CtrlK

Manager, Site Reliability Engineering at Monotype (Noida) · CodeHub Jobs

Job description

Are you our “TYPE”?

Monotype Global

Named "One of the Most Innovative Companies in Design" by Fast Company, Monotype brings brands to life through type and technology that consumers engage with every day.

The company's rich legacy includes a library that can be traced back hundreds of years, featuring famed typefaces like Helvetica, Futura, Times New Roman and more.

Monotype also provides a first-of-its-kind service that makes fonts more accessible for creative professionals to discover, license, and use in our increasingly digital world. We work with the biggest global brands, and with individual creatives, offering a wide set of solutions that make it easier for them to do what they do best: design beautiful brand experiences.

Monotype Solutions India

Monotype Solutions India is a strategic center of excellence for Monotype and is a certified Great Place to Work® three years in a row. The focus of this fast-growing center spans Product Development, Product Management, Experience Design, User Research, Market Intelligence, Research in areas of Artificial Intelligence and Machine learning, Innovation, Customer Success, Enterprise Business Solutions, and Sales.

Headquartered in the Boston area of the United States and with offices across 4 continents, Monotype is the world’s leading company in fonts. It’s a trusted partner to the world’s top brands and was named “One of the Most Innovative Companies in Design” by Fast Company.

Monotype brings brands to life through the type and technology that consumers engage with every day. The company's rich legacy includes a library that can be traced back hundreds of years, featuring famed typefaces like Helvetica, Futura, Times New Roman, and more. Monotype also provides a first-of-its-kind service that makes fonts more accessible for creative professionals to discover, license, and use in our increasingly digital world.

We are looking for an experienced and hands-on Site Reliability Engineering (SRE) Manager to lead the reliability, stability, and operational excellence of our enterprise platforms. This role will own both 24x7 incident management operations and SRE engineering efforts, ensuring high system availability, fast incident response, and continuous improvement of platform reliability.

You will lead a team responsible for maintaining uptime, reducing incidents, improving response times, and building a more proactive and self-sufficient SRE function. The role requires a balance of hands-on technical depth and people leadership, with a strong focus on automation, observability, release stability, and team maturity.

As we expand into AI-driven workloads, you will also support reliability, monitoring, and scalability of these systems.

What you’ll be doing:

Reliability & Incident Management

·Own end-to-end reliability of production systems, ensuring uptime within defined SLAs

Lead and govern a 24x7x365 incident management team, ensuring quick response and resolution
Act as escalation point during critical incidents and drive coordination across teams
Ensure proper incident tracking, communication, and status page updates

Incident Improvement & RCA

·Drive a strong blameless RCA culture across the team

·Ensure all customer-impacting incidents are analysed with clear root causes

·Track and drive closure of RCA action items to prevent repeat issues

·Identify recurring patterns and push for permanent fixes

Observability & Monitoring

·Own and improve observability using tools like Datadog, CloudWatch, ELK, Prometheus

·Guide teams on effective logging, alerting, and monitoring practices

·Reduce alert noise and improve signal-to-noise ratio

·Drive proactive monitoring and early detection of issues

Automation & Operational Efficiency

·Drive automation to reduce manual effort and operational toil

·Identify repetitive issues and build solutions to eliminate them

·Ensure runbooks and playbooks are created and followed for recurring incidents

Release Stability & Production Readiness

·Work with Product, Engineering & Platform teams to improve release quality and stability

·Ensure proper readiness checks before production deployments (monitoring, rollback, alerts)

·Reduce production issues caused by releases

AI Workload Reliability

Support reliability and monitoring of AI/ML workloads in production and experimentation environments.
Ensure visibility, stability, and cost awareness for AI-driven systems

·Bring structure and best practices as AI adoption grows

Team Leadership & Development

·Lead and mentor a team of ~14 engineers across operations and SRE excellence

·Build team maturity and reduce dependency on senior members

·Develop strong ownership and accountability within the team

Cross-team Collaboration

·Work closely with Engineering, Product and Platform teams

·Ensure smooth coordination during incidents and releases

·Communicate effectively with stakeholders during high-severity situations

·Collaborate with stakeholders to align reliability and platform strategies with business goals

Cost & Efficiency

·Partner with teams to optimize cloud usage and reduce unnecessary spend

·Balance reliability improvements with cost efficiency.

·Ensure security best practices are followed across infrastructure and applications in collaboration with security teams.

What we’re looking for:

Bachelor’s degree in computer science, Engineering, or related field.
Previous experience in a leadership or mentoring role, guiding and supporting junior team members.
10+ years of experience in SRE with proven experience managing production systems and 24x7 operations teams
Strong hands-on experience with AWS and Kubernetes (EKS preferred)
Strong understanding of incident management, RCA, and production support models
Experience with monitoring/observability tools (Datadog, CloudWatch, ELK, Prometheus, Grafana)
Experience driving automation and reducing operational toil
Understanding of microservices-based architectures
Strong knowledge of release processes and production readiness practices
Strong understanding of SLAs, SLIs, SLOs, and reliability metrics
Good understanding of cloud cost optimization (FinOps basics)
Exposure to or experience supporting AI/ML workloads
Strong leadership skills with experience managing and mentoring teams
Ability to stay calm and lead during high-severity incidents
Strong communication and stakeholder management skills
Structured problem-solving and decision-making ability
Certification in relevant technologies (e.g., AWS, Kubernetes) is a plus
Strategic mindset with ability to align reliability initiatives with business goals
Strong analytical and problem-solving skills for handling complex production issues
Understanding of security best practices across infrastructure and applications
Ability to standardize processes and improve operational consistency

What’s in it for you:

Hybrid work arrangements and competitive paid time off programs.
Comprehensive medical insurance coverage to meet all your healthcare needs.
Competitive compensation with corporate bonus program & uncapped commission for quota carrying Sales.
A creative, innovative, and global working environment in the creative and software technology industry.
Highly engaged Events Committee to keep work enjoyable.
Reward & Recognition Programs (including President's Club for all functions).
Professional onboarding program, including robust targeted training for Sales function.
Development and advancement opportunities (high internal mobility across organization).
Retirement planning options to save for your future, and so much more!

Manager, Site Reliability Engineering

Similar roles