Job description

About The Company:

iLink Digital is a Global Software Solution Provider and Systems Integrator, delivers next-generation technology solutions to help clients solve complex business challenges, improve organizational effectiveness, increase business productivity, realize sustainable enterprise value and transform your business inside-out. iLink integrates software systems and develops custom applications, components, and frameworks on the latest platforms for IT departments, commercial accounts, application services providers (ASP) and independent software vendors (ISV). iLink solutions are used in a broad range of industries and functions, including healthcare, telecom, government, oil and gas, education, and life sciences. iLink’s expertise includes Cloud Computing & Application Modernization, Data Management & Analytics, Enterprise Mobility, Portal, collaboration & Social Employee Engagement, Embedded Systems and User Experience design etc.

What makes iLink's offerings unique is the fact that we use pre-created frameworks, designed to accelerate software development and implementation of business processes for our clients. iLink has over 60 frameworks (solution accelerators), both industry-specific and horizontal, that can be easily customized and enhanced to meet your current business challenges.

Requirements

6–10 years of experience in SRE, DevOps, infrastructure and production support engineering roles.
Proven experience managing multi-cloud environments (AWS + Azure).
Demonstrated experience handling P1/P2 production incidents in cloud environments.
Familiarity with Prometheus, Grafana, Datadog, or Splunk

Design, deploy, and manage Kubernetes clusters for production workloads at scale.
Architect and maintain PostgreSQL databases — performance tuning, HA setup, backup/restore strategies.
Build and manage cloud infrastructure on AWS and Azure using Terraform and Ansible.
Lead vulnerability management programs — identify, prioritize, and remediate security risks across the stack.
Define and enforce SLOs, SLIs, and error budgets; drive reliability improvements across services.
Implement IaC best practices, automate provisioning pipelines, and reduce manual toil.
Collaborate with development teams on capacity planning, disaster recovery, and incident post-mortems.
Build and maintain monitoring, alerting, and observability frameworks (Prometheus, Grafana, ELK, etc.).

Lead end-to-end incident management — detection, triage, escalation, resolution, and communication.
Serve as an on-call engineer; manage and respond to alerts and production incidents effectively.
Conduct blameless post-mortems and implement action items to prevent recurrence.
Monitor system health using dashboards and alerting tools; proactively identify degradation risks.
Collaborate with Dev, QA, and infrastructure teams to identify and reduce toil and failure points.
Support Kubernetes workloads and assist in troubleshooting cluster-level issues.
Work across AWS and Azure environments for incident containment and recovery.
Maintain and improve runbooks, playbooks, and incident response documentation.

Strong understanding of networking, security, and distributed systems.
Excellent communication skills for cross-team collaboration and post-mortem documentation.
Experience with Helm, ArgoCD, or GitOps workflows.

Benefits

Competitive salaries
Medical Insurance
Employee Referral Bonuses
Performance Based Bonuses
Flexible Work Options & Fun Culture
Robust Learning & Development Programs
In-House Technology Training

Senior Site Reliability Engineer

Similar roles

Job description

Requirements

Benefits

Mainframe Development Engineer - Senior Associate

Mainframe Development Engineer - Senior Associate

Senior Java Developer – Assistant Vice President

Senior Quality Assurance Engineer