




Job Summary: Ensure platform and application reliability, availability, and performance through automation, observability, and reliability engineering, collaborating with development and infrastructure teams. Key Highlights: 1. Proactively identify and mitigate issues 2. Improve application observability, performance, and resilience 3. Automate operational processes to increase efficiency **The Mission** Ensure the reliability, availability, and performance of the company's platforms and applications by automating operational processes, continuously improving observability, and advancing reliability engineering practices. Collaborate closely with development and infrastructure teams to build resilient, scalable, and efficient systems, ensuring appropriate service cost, quality, and availability levels. **Your Challenges Will Include** * Ensuring production system stability and reliability by proactively identifying and mitigating issues; * Collaborating with development teams to improve application observability, performance, and resilience; * Automating operational processes and infrastructure routines to increase efficiency and reduce human errors; * Monitoring environments and applications, analyzing metrics, logs, and alerts to identify issues before they impact users; * Analyzing and resolving incidents, contributing to continuous service improvement and recurrence prevention; * Supporting the evolution of deployment, CI/CD, and infrastructure platforms; * Participating in defining and improving system architectures alongside development squads; * Promoting best practices in reliability, observability, and automation across the technology area; * Contributing to continuous improvement initiatives for the platform and engineering processes. **Above all, you must align with our purpose: valuing people so each can build their own story.** **Additionally, it is desirable that you have** * A completed bachelor's degree in Computer Science, Systems Analysis and Development, or other related undergraduate programs; * Experience in the Technology field, preferably in operations, infrastructure, or system reliability; * Knowledge of Linux operating systems; * Experience with containers and orchestration (Docker and Kubernetes); * Experience with cloud environments, preferably AWS; * Familiarity with monitoring and observability tools (Prometheus, Grafana, ELK, or similar); * Experience with automation and infrastructure-as-code (Terraform, CloudFormation, or similar); * Knowledge of version control (Git); * Experience with CI/CD pipelines; * Knowledge of scripting or automation (Shell, Python, or similar); * Familiarity with deployment tools and GitOps will be considered a plus.


