Job Description
About SiFi: SiFi is a rapidly growing B2B Fin-Tech company transforming expense management for businesses in Saudi Arabia. As a licensed EMI from the Saudi Central Bank, we empower companies with innovative tools to simplify finance management.Position OverviewAs a Senior Site Reliability Engineer (SRE), you will provide scalable, reliable, durable, and secure global database services for our clients’ cloud infrastructure hosted on AWS or GCP.Senior SRE Engineer will help in building highly reliable cloud services using a customer-first approach while innovating technically. You will understand our customers' needs and how we can meet themPrimary ResponsibilitiesIdentifies significant projects that result in substantial improvements in reliability, cost savings, and/or revenue.Identifies changes in the product architecture from the reliability, performance, and availability perspectives with a data-driven approach.Influences the product roadmap and works with engineering and product counterparts to influence improved resiliency and reliability of the Gitlab product.Proactively work on efficiency and capacity planning to set clear requirements and reduce the system resource usage to make GitLab cheaper to run for all our customers.Identify parts of the system that do not scale, provide immediate palliative measures, and drive long-term resolution of these incidents.Identify Service Level Indicators (SLIs) that will align the team to meet the availability and latency objectives.Collaboration and Communication:Leads initiatives and problem definition and scoping, design, and planning through epics and blueprints.Deep domain knowledge and radiating that knowledge through recorded demos, technical presentations, discussions, and Incident Reviews.Perform and run blameless RCAs on incidents and outages aggressively, looking for answers that will prevent the incident from ever happening again.For stable counterpart assignments, maintain awareness and actively influence stage group plans and priorities through participation in stage group meetings and a sync discussions. Act as a champion for reliability.Influence and Maturity:Set an example for a team of SREs with positive and inclusive leadership and discussion on work.Show ownership of a significant part of the infrastructure.Trusted to de-escalate conflicts inside the team.Qualifications5+ years of related experience.Performs application-specific production support, incident management, problem management, RCAs, and service restoration as needed to quickly respond to and resolve production issues.Collaborating with engineering and development teams to evaluate and identify optimal cloud solutions.Plan and achieve high availability, performance, and availability of the product service.Development/coding experience and skills for writing custom automation solutions.Strong understanding of web hosting infrastructure and high availability architecture.Demonstrated knowledge of fundamental cloud security (e.g., Identity and Access Management, ACL, firewalls).Deep understanding of AWS cloud services and how to leverage them for computing, storage, and managed services including, but not limited to databases, managed Kubernetes, ECS, and Python/Django application services.Strong Experience in Infrastructure as Code (IaC) technologies like Terraform.Familiarity with Kubernetes-specific platform components, such as ingress controllers, cluster DNS, autoscalers, and others.
Job Application Tips
- Tailor your resume to highlight relevant experience for this position
- Write a compelling cover letter that addresses the specific requirements
- Research the company culture and values before applying
- Prepare examples of your work that demonstrate your skills
- Follow up on your application after a reasonable time period