Job Description

Site Reliability Engineer - Level 1(MSP – Frontline Operations)The Site Reliability Engineer (SRE) is the first responder for MSP customers, ensuring incident management, triage, and resolution within SLA-defined timelines.As the front-line escalation point, this role diagnoses, troubleshoots, and resolves cloud infrastructure issues, escalating complex incidents to specialized teams (Senior SREs, SecOps, FinOps, or Cloud Engineering) when required.Key ResponsibilitiesAct as the first responder for MSP client incidents, providing rapid troubleshooting, diagnosis, and resolution within SLA timelines.Triage incoming issues to determine whether they can be resolved directly or require escalation to specialized teams.Monitor cloud environments using AWS CloudWatch, Datadog, and FreshService to detect performance, security, and availability issues proactively.Maintain incident records, documenting resolution steps, escalation timelines, and root causes.Collaborate with Senior SREs and internal teams to ensure seamless incident resolution and escalation workflows.Participate in post-incident reviews, contributing insights to improve operational efficiency and prevent future incidents.Provide 24/7 on-call support for one full week per month, ensuring availability for high-priority incidents and urgent operational needs.Required Skills & Experience2+ years of experience in cloud operations, incident management, or a similar role, with hands-on AWS experience.Strong knowledge of AWS core services (compute, networking, storage, IAM).Hands-on experience with monitoring and logging tools (e.g., AWS CloudWatch, Datadog, ELK stack, Prometheus, Grafana).Understanding of Kubernetes (ability to support containerized applications) is a plus.Familiarity with incident management systems (e.g., FreshService) for tracking and responding to alerts.Excellent troubleshooting and problem-solving skills, able to quickly determine root causes of issues.Strong organizational and multitasking abilities, especially under pressure in fast-paced operational environments.Good communication skills, ensuring clear documentation and effective collaboration with technical teams.Familiarity with ITIL-based service management workflows is a plus. Required QualificationsAWS Certified SysOps Administrator OR AWS Solutions Architect Associate is a bonusTerraform Associate Certification or equivalent hands-on experience (Preferred, but not required for front-line troubleshooting).Experience working in an MSP or cloud operations team (Preferred).Ability to work independently, prioritize incidents effectively, and escalate when necessary.

Job Application Tips

  • Tailor your resume to highlight relevant experience for this position
  • Write a compelling cover letter that addresses the specific requirements
  • Research the company culture and values before applying
  • Prepare examples of your work that demonstrate your skills
  • Follow up on your application after a reasonable time period