Job Description

About us:Adex International is an IT services company with our expertise in Cloud and DevOps. We help companies of all verticals and sizes to get the best out of their technology systems with our expertise and approach.Our team is our biggest strength & we value our members over anything else. We pride ourselves in our team, work environment, and employee-first approach where we emphasize on employee wellbeing & growth strongly.Job Responsibilities:We are seeking a highly skilled, Proactive and motivated Site Reliability Engineer (SRE) to join our Ops Level 2 Team. You will be a key player in resolving complex incidents, driving automation, optimizing operational processes, and collaborating with various engineering and support teams to maintain a high-quality service for our customers.If you thrive in challenging, fast-paced environments, possess deep expertise in cloud-native technologies (especially Kubernetes), and are passionate about solving complex system issues, we encourage you to applyAs Site Reliability Engineer (SRE) - Ops Level 2, your responsibilities would typically be to:1. Incident Management & Resolution (70%):Serve as an escalation point for complex incidents that cannot be resolved by Level 1 support, driving efficient resolution to minimize downtime and prevent further escalations.Utilize advanced monitoring, observability (e.g., Prometheus, Grafana, ELK Stack), and automation tools to rapidly diagnose, troubleshoot, and resolve critical issues across cloud (AWS, Azure, GCP), Linux, and Kubernetes environments.Implement and maintain automated remediation workflows and playbooks to accelerate incident resolution and reduce manual toil.Participate in a periodic on-call rotation to respond to critical incidents outside of business hours.Contribute to achieving a high resolution rate for Level 2 incidents with minimal escalations to Level 3.2. Operational Excellence & Automation (20%):Proactively identify and implement automation opportunities using scripting (Python, Go, Bash), configuration management (Ansible), and orchestration platforms (e.g., Jenkins, ArgoCD) to enhance operational efficiency and reduce manual workloads.Develop, refine, and maintain comprehensive Standard Operating Procedures (SOPs), runbooks, and troubleshooting guides for common and complex operational issues.Collaborate with engineering teams to integrate operational insights into system design, contributing to more resilient and observable systems (Shift-Left SRE).3. Incident Analysis & Collaboration (10%):Conduct deep-dive analysis of incident trends to identify root causes, recurring problems, and systemic weaknesses. Propose and implement preventative measures and long-term solutions.Facilitate strong collaboration and communication with NOC, Engineering, Product, and other support teams to ensure alignment, effective knowledge transfer, and continuous improvement.Contribute to post-incident reviews (PIRs/RCAs) to extract learnings and drive actionable improvements.Key Technical Skills & Experience:1. Deep Expertise in Kubernetes (5+ years experience):Extensive experience troubleshooting and resolving complex issues within Kubernetes clusters (e.g., pod connectivity, OOMKilled errors, DaemonSets, StatefulSets).Proficiency in Kubernetes administration, including managing namespaces, deployments, services, ingress, and persistent volumes (PVs/PVCs).Hands-on experience with Kubernetes autoscaling (HPA, VPA, Cluster Autoscaler) and familiarity with modern cluster management tools like Karpenter.Understanding of Kubernetes master plane components (API Server, etcd, Scheduler, Controller Manager) and their purpose.Ability to diagnose and resolve inter-pod communication issues within the same or different namespaces.Must have hands-on experience creating k8s clusters from scratch and extensive experience on troubleshooting production issues with customer defined RTO of 15 minutes.2. Cloud Infrastructure & Administration (AWS/Azure/GCP):5-10 years of hands-on experience troubleshooting and managing resources in public cloud environments (AWS strongly preferred).Proficiency in diagnosing and resolving issues related to EC2, VPC, Load Balancers (ALB/NLB), Route 53, S3, RDS, Lambda, and other core cloud services.Experience with Infrastructure as Code (IaC) tools like CloudFormation or Terraform, including performing rollbacks.Understanding of backend Lambda communication patterns with global services.3. Linux System Administration (Expert Level):In-depth knowledge of Linux/Unix operating systems, including process management, file systems (e.g., inodes), networking, and troubleshooting tools (strace, tcpdump, lsof, top, vmstat, iostat).Strong understanding of memory management in Linux and ability to diagnose related issues.Experience diagnosing and resolving issues on remote servers across different regions.4. Networking & Load Balancing:Solid understanding of OSI, TCP/IP, DNS, HTTP/S, ARP and network troubleshooting.Experience with web server management (Nginx, Apache) and global traffic management configurations.Ability to develop capabilities in troubleshooting network switches, firewalls, and VPNs (e.g., during datacenter outages).5. Monitoring & Observability:Expertise in setting up, configuring, and utilizing monitoring and observability tools (e.g., Prometheus, Grafana, ELK Stack, Splunk, Datadog) to identify, investigate, and resolve infrastructure issues.6. Automation & CI/CD:Proficiency in scripting languages (Python, Go, Bash) and automation platforms (Ansible, Chef, Puppet).Experience with CI/CD pipelines (Jenkins, GitLab CI, GitHub Actions) for automated deployments and infrastructure changes.Strong version control skills with Git (GitHub).7. Security & Compliance:Familiarity with secrets management (e.g., HashiCorp Vault) and certificate management (SSL/TLS).Core Soft Skills:Exceptional Problem-Solving: Ability to rapidly analyze complex technical incidents under pressure, identify root causes, and implement effective solutions.Strong Communication & Collaboration: Excellent verbal and written communication skills to articulate complex technical issues clearly to diverse audiences (technical and non-technical). Proven ability to work effectively with cross-functional teams.Adaptability & Resilience: Thrive in a fast-paced, dynamic environment, handling multiple priorities and quickly adapting to new technologies and challenges.Time Management & Initiative: Proactive approach to identifying and addressing operational inefficiencies, prioritizing critical escalations, and driving continuous improvement.Customer/Stakeholder Focus: A deep commitment to maintaining system uptime, optimizing resolution efficiency, and ensuring a positive experience for internal and external stakeholders.Education and ExperienceBachelor’s degree in Computer Science, IT, Data Science, Engineering, or related field.5+ years of professional experience in the related field.Why Join us?Adex is an equal opportunity employer and does not discriminate on the basis of race, national origin, gender, gender identity, sexual orientation, protected veteran status, disability, age, or other legally protected status. We carefully select candidates, test them for technical competency and emphasize on communication skills.Great Learning & Development OpportunitiesIndustry Leading People and PoliciesWork-life BalanceFun FridaysEmployee wellbeingLunch and snacks at OfficeInteresting compensation and benefitsStellar opportunity to work with the rising companyA fast-paced tech environmentWeekends off (Saturday & Sunday)Attractive Fringe benefitsHow to apply?We're constantly seeking exceptional individuals eager to bring their impressive talents to our team. If you're a dynamic force of skill and enthusiasm. Join us in shaping our team's success! To apply, simply send your updated resume to careers@adex.ltd.

Site Reliability Engineer

Job Description

Job Application Tips

You May Also Be Interested In

Senior AI/ML Full stack engineer

Lead Full-Stack Mobile Engineer (Flutter + Postgres)

Senior Frontend Developer (React + Next.js)

Full Stack Engineering Manager

Acting Resident Engineer

Back End Developer

Job Description

Job Application Tips

Share this job

You May Also Be Interested In

Senior AI/ML Full stack engineer

Lead Full-Stack Mobile Engineer (Flutter + Postgres)

Senior Frontend Developer (React + Next.js)

Full Stack Engineering Manager

Acting Resident Engineer

Back End Developer

Apply for this Job

This Job Has Expired