Tekgence Inc

Site Reliability Engineer

Posted: 14 hours ago

Job Description

Job Title: Site Reliability Engineer (SRE) Duration: 6+ monthsLocation -Haifa, Haifa Disctrict, Israel. (Hybrid 2-3 days in a week) 5 years’ experience On-prem infrastructure managementManage on-prem infrastructure. Maintain uptime, reliability and readiness of on-prem engineering cloud spread across multiple data centers. Guard SLAsGuard service level agreements (SLAs) for critical engineering services. Implement monitoring, alerting, and incident response procedures to ensure adherence to defined performance targets. Perform root cause analysis and post-mortems of incidents for any threshold breaches.ObservabilitySet up and manage monitoring and logging tools such as Prometheus, Grafana, or the ELK Stack to oversee system health and performance. Maintain KPI pipelines using Jenkins, Python and ELK.Improve monitoring systems by adding custom alerts based on business needs.Automation & OptimizationHelp in capacity planning, optimization and better utilization efforts. Day-to-Day SupportSupport user reported issues & issues. Monitor alerts and take necessary action.Actively participate in WAR room for critical issues Collaboration & DocumentationCreate and maintain documentation for operational procedures, configurations, and troubleshooting guides. Tech stackBaremetal data center machine management tools like IPMI, Redfish, KVM etc.Automation using Jenkins, Python, Go, Bash.Infrastructure tools like Kubernetes, MySQL, Prometheus, Grafana and ELK.Any familiarity with hardware like GPU & Tegras is a plus

Job Application Tips

  • Tailor your resume to highlight relevant experience for this position
  • Write a compelling cover letter that addresses the specific requirements
  • Research the company culture and values before applying
  • Prepare examples of your work that demonstrate your skills
  • Follow up on your application after a reasonable time period

You May Also Be Interested In