Monday, October 27, 2025

Job Description

We are seeking an experienced Site Reliability Engineering (SRE) Tech Lead to enhance the reliability, scalability, and performance of our systems. You will lead the SRE team, drive infrastructure improvements across public and private clouds, and ensure operational excellence. This role requires both technical depth and strong leadership to guide our growing engineering organization.Key Responsibilities:Define and execute SRE strategy and roadmap for service reliability and scalability.Lead monitoring, alerting, and observability platform improvements.Establish and track SLOs/SLAs, managing Error Budgets effectively.Identify and resolve performance and latency issues across systems.Act as incident commander during outages and lead Root Cause Analysis (RCA).Drive automation of operational processes to improve efficiency.Mentor and develop SRE engineers, ensuring high team performance.Collaborate with cross-functional teams (Dev, Infra, Security) to foster a DevOps culture.Must-Haves:5+ years in SRE or infrastructure engineering, with 2+ years in a lead role.Strong experience with Kubernetes, CI/CD tools, and cloud platforms (AWS, GCP, Azure).Expertise in monitoring, alerting, and logging tools (Prometheus, Grafana, ELK, Datadog).Solid understanding of UNIX systems, networking (TCP/IP, HTTP), and automation (Shell, Python).Proven leadership, communication, and problem-solving skills.Nice-to-Haves:Experience in web application development or test automation.Hands-on experience with observability improvements and Error Budgets.Familiarity with cross-cultural global teams.

Job Application Tips

  • Tailor your resume to highlight relevant experience for this position
  • Write a compelling cover letter that addresses the specific requirements
  • Research the company culture and values before applying
  • Prepare examples of your work that demonstrate your skills
  • Follow up on your application after a reasonable time period

Related Jobs