Sunday, October 26, 2025
Qlay

[Remote] Site Reliability Engineer

Posted: 17 hours ago

Job Description

IMPORTANT NOTE: If you have experience with GPU, please mention it clearly in your projects/work experience (e.g., years of experience, responsibility or achievement in previous work).As a Site Reliability Engineer, your key features include:Kubernetes operations – design, run, and improve large multi-clusterKubernetes environments on AWS and Google Cloud, plus on-prem clusters; add support for Azure or Oracle Cloud when needed.Infrastructure as code – manage everything with Terraform or Pulumi and follow GitOps workflows.CI/CD – keep automated build and release pipelines reliable, with safe rollback paths.GPU fleet management – run NVIDIA drivers, MIG partitioning, autoscaling, and firmware updates; extend the same practices to AMD GPUs when they appear.Observability – operate and scale Prometheus and Grafana, define SLIs/SLOs, and automate capacity tracking.Incident response – share an on-call rotation, lead post-incident reviews, and keep runbooks current.Mentorship and process building – establish standard SRE processes and teach best practices to the wider engineering team.Preferably graduated from a top university around the world.+4 years of experience as a Site Reliability EngineerExpert knowledge of Kubernetes internals and large-cluster administration, both cloud and on-prem.Hands-on experience with AWS and Google Cloud; familiarity with Azure or Oracle Cloud is a plus.Strong skills with Terraform or Pulumi, GitOps tools (Argo CD, Flux, or similar), and CI/CD pipelines.Deep understanding of Linux and networking fundamentals.Experience managing NVIDIA GPU clusters; AMD/ROCm knowledge is a bonus.Familiarity with specialized GPU clouds such as Lambda or Nebius is helpful.Solid background with Prometheus and Grafana at scale.Language: Working-level proficiency in English.Paid VacationsAnnual Bonus: 1-month salary This is a full-time position requiring 40 hours per week, but it will be structured as contractor work.Devices: You will be expected to use your own computer to perform the work.Sole Employment: No second job is permitted.

Job Application Tips

  • Tailor your resume to highlight relevant experience for this position
  • Write a compelling cover letter that addresses the specific requirements
  • Research the company culture and values before applying
  • Prepare examples of your work that demonstrate your skills
  • Follow up on your application after a reasonable time period

Related Jobs