




* Lead the architecture, implementation, and maintenance of observability solutions (monitoring, logging, tracing) for distributed systems. * Develop and implement automation tools and processes for infrastructure provisioning, configuration, and management (IaC). * Collaborate closely with development teams to integrate observability from the early stages of the software lifecycle. * Design and build effective dashboards and alerts that provide actionable insights into the health and performance of our services. * Act as an evangelist for DevOps and observability practices, mentoring team members and fostering a culture of continuous improvement. * Actively participate in incident resolution, leveraging observability data to rapidly diagnose and mitigate issues. * Identify bottlenecks and propose performance and scalability improvements across our entire technology stack. * Stay up-to-date with the latest trends and technologies in DevOps and observability. **Essential Requirements** * Solid experience as a DevOps Engineer or Site Reliability Engineer (SRE), with proven focus on observability. * Proficiency in monitoring tools and platforms (e.g., Prometheus, Grafana, Datadog, New Relic, Zabbix). * Experience with centralized logging systems (e.g., ELK Stack — Elasticsearch, Logstash, Kibana; Loki; Splunk). * Practical knowledge of distributed tracing tools (e.g., Jaeger, Zipkin, OpenTelemetry). * Experience with infrastructure-as-code (IaC) using tools such as Terraform, Ansible, CloudFormation, or Pulumi. * Strong expertise in container orchestration platforms, especially Kubernetes. * Proficiency in at least one scripting language (e.g., Python, Go, Bash, Ruby). * Experience with cloud providers (AWS). * Solid understanding of networking, operating systems, and security concepts. * Ability to work in an agile environment and effectively collaborate with cross-functional teams.


