Job description

AgileEngine is an Inc. 5000 company that creates award-winning software for Fortune 500 brands and trailblazing startups across 17+ industries. We rank among the leaders in areas like application development and AI/ML, and our people-first culture has earned us multiple Best Place to Work awards. WHY JOIN US If you're looking for a place to grow, make an impact, and work with people who care, we'd love to meet you! ABOUT THE ROLE We are looking for a SRE Operations Engineer to maintain reliability across a cloud-based SaaS platform. You’ll handle live incidents, improve observability, and reduce toil through automation using Kubernetes, Terraform, Grafana, and AWS. Hands-on, execution-focused, with real ownership across CI/CD pipelines, GitOps workflows, and on-call rotations. WHAT YOU WILL DO - Monitor and support production and staging environments to ensure availability, performance, and stability; - Respond to incidents, perform triage and root cause analysis, and contribute to remediation efforts; - Participate in on-call rotations with defined SLAs; - Handle operational requests from internal teams; - Maintain and improve monitoring, alerting, dashboards, logs, and metrics; - Support CI/CD pipelines, production releases, and GitOps workflows; - Contribute to automation initiatives to reduce operational overhead; - Maintain and improve Kubernetes-based infrastructure and containerized workloads; - Support Infrastructure as Code practices and environment improvements. MUST HAVES - 2+ years of experience in Site Reliability Engineering, DevOps, or Production Operations ; - Experience with AWS supporting production environments; - Experience supporting production SaaS applications; - Strong understanding of CI/CD systems (GitHub Actions, Jenkins, CircleCI); - Experience with GitOps and Git fundamentals; - Experience using GitHub, Jira, and Confluence ; - Experience with Kubernetes (EKS, kOps or similar); - Experience with Docker and containerization ; - Experience with observability tools (Grafana, Prometheus, Loki, PagerDuty); - Proficiency in scripting ( Bash, Python, or Go ); - Experience with Infrastructure as Code (Terraform, Helm); - Ability to work within structured operational processes and SLAs; - Strong written and verbal English communication skills; - Self-driven with a growth mindset. NICE TO HAVES - AWS certifications such as Solutions Architect, DevOps Engineer, or SysOps Administrator; - Experience with multi-tenant SaaS environments; - Experience working in globally distributed teams; - Familiarity with ChatOps practices; - Experience improving monitoring quality and reducing alert fatigue. PERKS AND BENEFITS - Professional growth: Mentorship, TechTalks, and personalized growth roadmaps. - Competitive compensation: USD-based pay with education, fitness, and team activity budgets. - Exciting projects: Modern solutions with Fortune 500 and top product companies. - Flextime: Flexible schedule with remote and office options. Meet Our Recruitment Process It includes main stages: Application â Coding Challenge â Video Interview â Technical Interview or Interview with the Hiring Manager(s). Each step helps us understand your skills and overall fit. If it’s a match, you’ll receive an offer.

Site Reliability Engineer ID53670

Similar roles

Job description

Senior Site Reliability Engineer

Site Manufacturing Maintenance Leader

Line Manager, Site Contracts Management- FSP

Line Manager, Site Contracts Management- FSP