Winspire Group

Site Reliability Engineer (SRE)

Posted: 1 hours ago

Job Description

🎮 Welcome to Winspire Group! 🌍We are a leading online entertainment company at the forefront of the digital gaming industry, offering both real money and free-to-play casino experiences to players worldwide. Since 2018, we’ve grown into a global force with 200+ team members, operations in 24+ countries, and multiple gaming licenses.✨ About UsAt Winspire Group, we combine cutting-edge technology with immersive entertainment to create unforgettable gaming experiences. Our story is built on innovation, player satisfaction, and responsible gaming practices.🚀 Our MissionOur mission is to leverage expertise and passion to provide unmatched value to our customers and stakeholders. We are dedicated to creating a safe, engaging, and secure gaming environment while setting new standards of excellence in online entertainment.💡 Innovation & TechnologyThrough continuous innovation and strategic partnerships, our platform integrates advanced technology with gameplay, delivering an experience that goes beyond geographical boundaries. We embrace change and push boundaries to create solutions that exceed expectations.🙌 Join Our TeamAt Winspire Group, we believe in collaboration, growth, and inclusivity. Our values—integrity, innovation, excellence, customer focus, continuous improvement, and diversity—guide everything we do. Join our global team and be part of a company shaping the future of online entertainment.🎯 Role SummaryAs a Site Reliability Engineer (SRE), you will play a critical role in maintaining the reliability, performance, and scalability of our systems and applications. You will design and manage monitoring infrastructure, respond to incidents, automate processes, and support system improvements—all with the aim of ensuring exceptional service for our users worldwide.💼 Key Responsibilities🔍 Monitoring & ObservabilityDesign and implement proactive monitoring and alerting solutions using tools like Prometheus, Grafana, Loki, CloudWatch, and DatadogAnalyze telemetry data to identify anomalies and root causes before they affect end usersMaintain high visibility into system health and performance through dashboards and SLO/SLI tracking🚨 Incident & Problem ManagementRespond to incidents, perform thorough analysis, and contribute to postmortemsCollaborate with Engineering, DevOps, and QA teams to reduce MTTR and eliminate recurring issuesUse tools like Opsgenie, Jira, and Slack for efficient alerting, coordination, and documentation🤖 Automation & EfficiencyAutomate deployment, configuration, and monitoring tasks using Bash, Python, Ansible,Monitor and maintain self-healing infrastructure using Kubernetes, Docker, and IaC tooling🤝 Collaboration & CommunicationServe as a reliability champion across technical teamsLead and document incident response practices and cross-functional drillsPartner with Data, Product, and Engineering teams to optimize end-to-end deliveryFluent English, Russian will be a plus🧩 Key Requirements✔️ Must-Have SkillsProficiency in observability tools: Prometheus, Grafana, Loki, LogQL, Cloud-native monitoring (AWS/GCP), Jaeger, DataDog, ChecklyStrong scripting abilities in Bash, Python, and/or PowerShellHands-on experience with Kubernetes, Docker, IaC tools (e.g., Ansible, Git), and CI/CD flowsFluency in incident lifecycle management (ITIL familiarity is a plus)Knowledge of SLIs/SLOs/SLAs and how to manage them effectively🌐 Preferred SkillsFamiliarity with, Opsgenie, Playwright (for automated testing), and Checkly.Working knowledge of databases: SQL Server, BigQuery, MySQLLinux system administration experienceSecurity awareness (basic understanding of tools like Kali Linux, Wireshark, nmap, etc)Experience with monitoring as code, specifically using Checkly’s CLI, JavaScript/TypeScript SDKs, or TerraformStrong ability to debug complex user flows with Playwright trace viewer, DOM snapshots, and network waterfalls for failing browser checks in distributed environments🎓 Courses and Certifications (a plus)AWS Cloud PractitionerGoogle Cloud Professional, Associate Certifications or Azure FundamentalsLPIC-1 Linux AdministratorFundamentals of Infrastructure as Code (IaC)Datadog Fundamentals, APM & Distributed Tracing Fundamentals and Log Management FundamentalsGetting Started with Synthetic Monitoring and Browser Testing by Datadog Learning🌟 What We OfferCorporative events and team building activitiesParticipation in corporate sports events and team challenges – stay active and bond with colleagues outside the office!Lunch allowanceHybrid work format as per WFH internal policyHealth insurance from Day 1 Birthday and anniversary giftsInclusive, dynamic, and innovative work culture

Job Application Tips

  • Tailor your resume to highlight relevant experience for this position
  • Write a compelling cover letter that addresses the specific requirements
  • Research the company culture and values before applying
  • Prepare examples of your work that demonstrate your skills
  • Follow up on your application after a reasonable time period

You May Also Be Interested In