




Job Summary: A technically experienced professional responsible for leading Site Reliability Engineering (SRE) initiatives to ensure availability, scalability, resilience, and observability of critical applications and services. Key Highlights: 1. Lead Site Reliability Engineering (SRE) initiatives 2. Serve as a technical reference and foster the DevOps culture 3. Collaborate with development, infrastructure, and security teams A technically experienced professional responsible for leading Site Reliability Engineering (SRE) initiatives, with a focus on ensuring the availability, scalability, resilience, and observability of the company's critical applications and services. Acts as a technical reference within the team, promoting reliability best practices, fostering the DevOps culture, and supporting strategic decision-making in collaboration with architecture, security, and development teams. **Responsibilities and Duties:** * Collaborate with development, infrastructure, and security teams to design, build, and maintain reliable and scalable systems; * Participate in planning and execution of load, chaos, and failover testing, mitigating risks and identifying bottlenecks; * Develop and maintain automation tools for monitoring, deployment, rollback, and incident response; * Monitor and respond to critical incidents, conducting root cause analysis (RCA) and proposing preventive actions; * Support the evolution of CI/CD processes, infrastructure-as-code (IaC), and security; * Lead automation, observability, and performance initiatives for critical systems; * Design, implement, and evolve monitoring, metrics, distributed tracing, and logging solutions; * Conduct incident reviews (Postmortems), including root cause analysis and structured action plans; * Identify and apply continuous improvements to SLOs, SLIs, and SLAs; * Serve as the focal point for failure mitigation, recovery, and business continuity planning; * Lead the reliability and resilience culture across the entire organization; * Provide infrastructure support when necessary, ensuring operational continuity of environments; * Document solutions, architectures, technical standards, dashboards, and operational procedures; * Mentor junior and mid-level professionals, promoting technical upskilling and best practices. **Requirements and Qualifications:** * Operating Systems (Linux) — Advanced; * Cloud (AWS, GCP, or Azure) — Advanced; * Networking (TCP/IP, DNS, HTTP) — Advanced; * Git and Version Control — Advanced; * Docker (Containers) — Advanced; * Kubernetes / Orchestration — Advanced; * Logging (ELK, Loki) — Advanced; * Monitoring and APM — Advanced; * CI/CD — Advanced; * Infrastructure-as-Code (IaC) — Advanced; * Security (DevSecOps principles) — Advanced. * Experience with on\-premises environments, including: * Active Directory (AD) * Office 365 / Microsoft 365 * Firewalls (Fortinet, Palo Alto, or similar) * Access Points and corporate networks **Benefits:** * Meal Voucher (R$ 997.70/month); * Food Voucher (R$ 771.13/month); * SulAmérica Health Insurance (private room); * SulAmérica Dental Insurance; * Profit Sharing (PLR); * Christmas Bonus Card (Alelo) R$ 771.13; * Life Insurance; * Daycare or Babysitter Allowance (R$ 502.29 until age 6); * 6-month maternity leave and 20-day paternity leave; * Birthday day off; * Early 13th salary payment in May; * TotalPass; * OnHappy. * **Employment Type:** CLT ***All our positions are inclusive of Persons with Disabilities (PwD) and all forms of diversity.***


