




Summary: Join our platform team as a Middle Site Observability Engineer to enhance observability, provide operational support for Kubernetes, and improve reliability for AI research on Azure Stack. Highlights: 1. Enhance observability for Kubernetes production services for AI research 2. Deliver operational support and improve reliability practices 3. Collaborate with engineering and research teams to raise observability standards We are strengthening our platform team with a **Middle Site Observability Engineer** to keep Kubernetes production services stable for AI research on Azure Stack. You will enhance observability, handle business\-hours operational support, and work closely with engineering and research partners to improve reliability and processes—apply now. **Responsibilities** * Develop, operate, and enhance observability capabilities, including dashboards and visualizations in Grafana or similar tools * Establish and maintain metrics, SLIs, SLOs, and alerting approaches for production platforms * Deliver business\-hours operational support for Kubernetes\-based environments through troubleshooting, log analysis, and metrics\-driven investigations * Assist with production operations for SQL\-based systems by diagnosing issues and supporting performance investigations * Investigate incidents and system behavior to identify root causes, participate in post\-incident reviews, and propose improvements to monitoring and reliability practices * Partner with engineering, platform, and research teams to raise observability standards, refine operational processes, and increase system reliability * Create and maintain documentation, share knowledge across the team, and drive ongoing improvement activities **Requirements** * Hands\-on experience of 2\+ years in Site Reliability Engineering, DevOps or Production Support for live production systems * Practical knowledge of observability and monitoring stacks such as Grafana, Prometheus, Elastic Stack, or Datadog * Solid understanding of Linux systems with strong troubleshooting abilities and log analysis skills * Background supporting Kubernetes\-based production environments * Working experience with SQL production support, including query troubleshooting and basic performance analysis * Proficiency in automation scripting using Python, Bash, or similar languages * Ability to assess incidents, determine root causes, and contribute to continuous improvement efforts * Effective communication skills and comfort collaborating with distributed, cross\-functional teams * English proficiency at an intermediate to advanced level (B1–C1\)


