About
The Company:
iLink Digital is a
Global Software Solution Provider and Systems
Integrator, delivers next-generation technology
solutions to help clients solve complex business
challenges, improve organizational effectiveness,
increase business productivity, realize sustainable
enterprise value and transform your business
inside-out. iLink integrates software systems and develops
custom applications, components, and frameworks on the latest
platforms for IT departments, commercial accounts, application
services providers (ASP) and independent software vendors
(ISV). iLink solutions are used in a broad range of industries
and functions, including healthcare, telecom, government, oil
and gas, education, and life sciences. iLink’s expertise
includes Cloud Computing & Application Modernization, Data
Management & Analytics, Enterprise Mobility, Portal,
collaboration & Social Employee Engagement, Embedded
Systems and User Experience design
etc.
What makes
iLink's offerings unique is the fact that we use
pre-created frameworks, designed to accelerate software
development and implementation of business processes for our
clients. iLink has over 60 frameworks (solution accelerators),
both industry-specific and horizontal, that can be easily
customized and enhanced to meet your current business
challenges.
Requirements
-
6–10 years of experience in SRE, DevOps,
infrastructure and production support engineering roles.
-
Proven experience managing multi-cloud
environments (AWS + Azure).
-
Demonstrated experience handling P1/P2 production
incidents in cloud environments.
-
Familiarity with Prometheus, Grafana, Datadog, or
Splunk
-
Design, deploy, and manage Kubernetes clusters
for production workloads at scale.
-
Architect and maintain PostgreSQL databases —
performance tuning, HA setup, backup/restore strategies.
-
Build and manage cloud infrastructure on AWS and
Azure using Terraform and Ansible.
-
Lead vulnerability management programs —
identify, prioritize, and remediate security risks across the stack.
-
Define and enforce SLOs, SLIs, and error budgets;
drive reliability improvements across services.
-
Implement IaC best practices, automate
provisioning pipelines, and reduce manual toil.
-
Collaborate with development teams on capacity
planning, disaster recovery, and incident post-mortems.
-
Build and maintain monitoring, alerting, and
observability frameworks (Prometheus, Grafana, ELK, etc.).
-
Lead end-to-end incident management — detection,
triage, escalation, resolution, and communication.
-
Serve as an on-call engineer; manage and respond
to alerts and production incidents effectively.
-
Conduct blameless post-mortems and implement
action items to prevent recurrence.
-
Monitor system health using dashboards and
alerting tools; proactively identify degradation risks.
-
Collaborate with Dev, QA, and infrastructure
teams to identify and reduce toil and failure points.
-
Support Kubernetes workloads and assist in
troubleshooting cluster-level issues.
-
Work across AWS and Azure environments for
incident containment and recovery.
-
Maintain and improve runbooks, playbooks, and
incident response documentation.
-
Strong understanding of networking, security, and
distributed systems.
-
Excellent communication skills for cross-team
collaboration and post-mortem documentation.
-
Experience with Helm, ArgoCD, or GitOps
workflows.
Benefits
-
Competitive
salaries
-
Medical
Insurance
-
Employee
Referral Bonuses
-
Performance
Based Bonuses
-
Flexible
Work Options & Fun Culture
-
Robust
Learning & Development Programs
-
In-House
Technology Training