Job Purpose
As a Senior DevOps Engineer at Tarjama&, you will own the design, reliability, and scalability of our cloud-based production systems. You will lead the architecture, automation, and operation of our Azure/AWS infrastructure, drive security, performance, and cost efficiency across our platforms, and set DevOps standards and best practices for the team. You will act as a technical authority on cloud operations, mentor junior engineers, and partner with engineering leadership to shape the platform roadmap.
Duties & Responsibilities
Cloud System Operations
- Lead the design, deployment, and operation of Azure/AWS cloud-based production systems.
- Own system performance, incident response, and root-cause analysis across production applications.
- Define release engineering and pre-production validation standards to ensure system quality and functionality.
- Architect and enforce backup, disaster recovery, and cost optimization (FinOps) strategies across cloud environments.
- Lead container orchestration and workload management on AKS/Kubernetes clusters, including upgrades, scaling, and hardening.
Automation and Scripting
- Design and maintain enterprise-grade automation frameworks for operational and platform processes.
- Build reusable tooling and scripts (e.g., Python, Bash, PowerShell) for automation, observability, and incident response.
- Lead GitOps adoption and continuous-delivery practices using ArgoCD or Flux.
Security and Compliance
- Define and enforce cloud security best practices, IAM policies, and secrets management across environments.
- Establish and maintain security protocols and compliance posture (e.g., ISO 27001, SOC 2 controls relevant to infrastructure).
Monitoring and Metrics
- Architect and operate observability platforms (metrics, logging, tracing) across Azure/AWS, defining SLOs, SLIs, and alerting strategy.
- Drive operational excellence by analyzing reliability metrics and leading post-incident reviews and improvement initiatives.
Research and Evaluation
- Evaluate and recommend emerging technologies, tools, and architectural patterns for adoption.
- Lead vendor and product evaluations, including proofs-of-concept and total-cost-of-ownership analysis.
Communication and Collaboration
- Mentor junior and mid-level DevOps engineers through code reviews, pairing, and technical guidance.
- Partner with engineering, security, and product stakeholders to define technical requirements and influence platform direction.
- Communicate effectively with executive and technical audiences on cloud strategy, risk, and roadmap.
Education, Experience & Qualifications
- Bachelor’s degree in Computer Science, Information Systems, or a related field.
- 6+ years of hands-on experience in DevOps, Cloud Engineering, or SRE roles, including 3+ years with primary focus on Microsoft Azure (required).
- Expert-level Kubernetes administration, including cluster lifecycle management, upgrades, networking, and security hardening.
- Production experience operating Azure Kubernetes Service (AKS) at scale.
- Strong experience designing and maintaining Infrastructure as Code with Terraform, including module design and state management.
- Deep experience designing and operating CI/CD pipelines (e.g., GitHub Actions, Azure DevOps, GitLab CI).
- Hands-on experience with observability stacks (Prometheus, Grafana, Azure Monitor, ELK, or similar), including dashboard and alert design.
- Strong Linux system administration knowledge.
- Experience working with GitOps tools such as ArgoCD or Flux.
- Working knowledge of database administration in production (backups, performance tuning, HA/DR, and troubleshooting).
- Strong scripting and automation skills in Python, Bash, and/or PowerShell.
- Strong analytical and problem-solving abilities.
- Ability to collaborate effectively within cross-functional teams.
- Clear and precise documentation and communication skills.
- Fluency in both English and Arabic (spoken and written).
Behavioral Competencies
- Ability to Work Under Pressure
Technical Competencies
- Cloud Computing Fundamentals
- Networking Protocols and Topologies
- Monitoring and Logging Tools
- Backup and Disaster Recovery Concepts
- Container Orchestration (AKS / Kubernetes)