EPAM Systems

Site Reliability Engineer/Architect (SRE)

Posted: 1 minutes ago

Job Description

We are seeking an experienced and accomplished Site Reliability Engineer/Architect (SRE) to join our dynamic, fast-paced team.In this pivotal leadership role, you will be entrusted with architecting and implementing advanced SRE practices to ensure the reliability, scalability, and efficiency of our Generative AI (GenAI) enablement platform for enterprise use cases. The position offers a unique opportunity to work with cutting-edge technologies, collaborate with peers to drive technical excellence, and shape the operational strategy for an enterprise-grade, multi-cloud platform. ResponsibilitiesDefine and implement SRE principles, frameworks, and methodologies to ensure platform reliability and stabilityEstablish Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to create measurable reliability goals aligned with business objectivesCollaborate effectively with stakeholders, including senior leadership, to align the SRE vision with the overall technical and organizational strategyArchitect resilient systems by adopting innovative practices such as canary deployments, shadow traffic, and testing in production environmentsEnsure uninterrupted operational reliability for a multi-cloud, multi-tenant enterprise platformOptimize incident response practices and tools to ensure efficiency and effectiveness, implementing automated solutions where appropriateImplement robust logging, tracing, and monitoring systems to provide real-time insight, detect faults, and optimize performance proactivelyCollaborate with engineering teams to integrate observability frameworks into platform components, improving deployment and runtime confidenceSpearhead automation initiatives to reduce manual operational tasks and improve system scalabilityFoster a strong culture of operational excellence through thought leadership and mentorship, promoting an SRE-first mindset within all teamsCollaborate with engineering and product teams to craft scalable designs with reliability embedded throughout the software development lifecycleBuild partnerships with Director-level leadership to conceptualize, prioritize, and deliver on long-term SRE goals RequirementsA minimum of 7 years of professional experience in site reliability engineering, software engineering, or DevOps rolesStrong coding skills in languages such as Python, Go, or Java, with the ability to implement solutions to algorithmic challengesProven expertise in designing and managing multi-cloud environments (e.g., AWS, Azure, GCP), distributed systems, and multi-tenant architecturesKnowledge of CI/CD pipelines, microservices, and containerization technologies like Kubernetes, Docker, and HelmBackground in monitoring and observability tools like Prometheus, Grafana, OpenTelemetry, or DynatraceCompetency in incident management and production troubleshooting using tools like PagerDuty or similarSolid understanding of modern SRE concepts, including SLIs, SLOs, fault injection, and canary releasesFamiliarity with security best practices for cloud-native architectures and multi-tenant platforms Nice to haveKnowledge of cloud platforms such as AWS, Azure, or GCP, with experience applying multi-cloud strategiesBackground in the fundamentals of Generative AI technologies and related workflows We offerWe gather like-minded people:Engineering community of industry professionalsFriendly team and enjoyable working environmentFlexible schedule and opportunity to work remotely within PolandChance to work abroad for up to 60 days annuallyBusiness-driven relocation opportunitiesWe provide growth opportunities:Outstanding career roadmapLeadership development, career advising, soft skills, and well-being programsCertification (GCP, Azure, AWS)Unlimited access to LinkedIn Learning, Get Abstract, Cloud GuruEnglish classesWe cover it all:Stable income (Employment Contract or B2B)Participation in the Employee Stock Purchase PlanBenefits package (health insurance, multisport, shopping vouchers)Strategically located offices featuring entertainment and relaxation zones, table tennis and football, free snacks, fantastic coffee, and moreReferral bonusesCorporate, social and well-being eventsPlease, note:The set of bonuses might vary based on the role you apply for – specifics will be discussed with our recruiter during the general interview.We will reach out to selected candidates exclusively. EPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cutting-edge solutions, and have an opportunity to continuously learn and grow. No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potential. 

Job Application Tips

  • Tailor your resume to highlight relevant experience for this position
  • Write a compelling cover letter that addresses the specific requirements
  • Research the company culture and values before applying
  • Prepare examples of your work that demonstrate your skills
  • Follow up on your application after a reasonable time period

You May Also Be Interested In