Senior Site Reliability Engineer (SRE) - Observability

Indeed

Full-time

Onsite

No experience limit

No degree limit

79Q22222+22

Favourites

Some content was automatically translatedView Original

Description

Job Summary: The Asaas team is seeking a Senior SRE with a focus on Observability to evolve our monitoring strategy, ensuring full and proactive visibility into our platform. Key Highlights: 1. Work in an agile, collaborative, and challenging environment. 2. Lead the observability strategy with metrics, logs, and traces. 3. Promote observability culture and SRE best practices. If you’re passionate about innovation and seek to work in an agile, collaborative, and challenging environment, this could be your opportunity! The **Cloud** team at **Asaas** is looking for an expert in **Observability** to ensure full and proactive visibility across our platform. You will play a pivotal role in building and evolving our observability strategy, working across all three pillars: metrics, logs, and traces. As a **Senior SRE** focused on **Observability**, you will be responsible for implementing and enhancing our monitoring solutions, ensuring that our teams have the necessary information to make fast and accurate decisions. Your expertise in tools such as Prometheus, Grafana, OpenTelemetry, and SRE practices will be essential to guaranteeing the reliability and performance of our platform. Quality and observability are fundamental to serving over 230,000 customers! If you share this vision, join our team! Do you reside outside Joinville? No problem! This opportunity is open to remote/home office work. **Responsibilities and Duties** * Design, implement, and evolve the company’s observability platform covering the three pillars: metrics, logs, and traces; * Implement and maintain observability stacks; * Define and implement instrumentation standards for applications and infrastructure; * Create strategic and operational dashboards that provide actionable insights for teams; * Define, monitor, and manage Service Level Indicators (SLIs) and Service Level Objectives (SLOs), managing error budgets; * Implement intelligent alerting systems, reducing noise and focusing on actionable alerts; * Collaborate with development teams to improve application observability, promoting instrumentation practices; * Lead incident response from an observability perspective, ensuring rapid root cause identification; * Conduct detailed post-mortem analyses and propose data-driven improvements based on observability insights; * Promote and disseminate observability culture and SRE best practices across the organization; * Plan and execute capacity management strategies based on metrics; * Optimize costs and performance of observability solutions at scale; * Automate processes for collection, processing, and visualization of observability data; * Document architectures, runbooks, and observability-related procedures. **Requirements and Qualifications** * Proven experience implementing and managing observability platforms at scale; * In-depth knowledge of Prometheus, including PromQL, service discovery, federation, and remote write; * Advanced experience with Grafana for dashboard creation, alerting, and data source management; * Knowledge of distributed tracing (Jaeger, Tempo, X-Ray) and correlation among metrics, logs, and traces; * Experience with OpenTelemetry for application instrumentation; * Knowledge of scalable logging solutions (Loki, ELK Stack, CloudWatch Logs); * Experience with Cloud Computing, especially AWS; * Experience with containers (Docker) and orchestration (Kubernetes, ECS); * Hands-on experience with Infrastructure as Code (IaC) (AWS CDK, Terraform); * Knowledge of SRE practices, including SLIs, SLOs, Error Budgets, and Toil Reduction; * Proficiency in scripting languages (Python, Bash) and at least one programming language (Go, Java); * Understanding of Linux systems and diagnostic tools; * Experience managing incidents and post-mortem processes. **Nice-to-Haves** * AWS certifications (DevOps Engineer, Solutions Architect); * Experience with Grafana Mimir for large-scale metrics; * Knowledge of Thanos for high-availability Prometheus deployments; * Experience with APM tools (Datadog, New Relic, Dynatrace); * Knowledge of eBPF for low-level observability; * Experience in fintechs or regulated environments; * Knowledge of Machine Learning applied to AIOps and anomaly detection; * Experience with Chaos Engineering and resilience testing; * In-depth knowledge of networking and protocols (TCP/IP, DNS, HTTP/S); * Proficiency with Git, GitHub, and GitFlow; * Practical experience with agile methodologies (Scrum, Kanban); * Experience with relational databases (PostgreSQL, MySQL) and NoSQL databases (MongoDB, DynamoDB, Redis). **Additional Information** * Flexible working hours: 8 hours per day (Monday to Friday — Saturdays are not compensated); * CLT employment contract. **We are a Fintech**, a Payment Institution accredited by the Central Bank of Brazil, and **our purpose is to maximize business productivity through technology.** We offer a comprehensive solution for billing, payments, receivables anticipation, and serve over 200,000 customers—including freelancers, micro-entrepreneurs (MEI), and large enterprises. Our dream began in 2010 in Joinville/SC, and we believe the sky is not the limit for our growth. That’s why our team is now spread across Brazil! **Over 1,000 people dream together with Asaas — collaboratively, innovatively, efficiently, with autonomy and freedom to soar high.** Soaring high demands resources to live and work better, plus freedom to manage them. That’s why we welcome and care for our team by offering benefits supporting personal and professional growth: **For health and well-being:** We offer comprehensive medical and dental plans (no co-pay), life insurance, medication purchase assistance, and support for physical activities. Additionally, Neon is our partner for financial health, and Zenklub supports physical and mental health (we offer four free monthly therapy or nutritionist sessions). At our headquarters, we also offer *quick massage.* **For meals and family:** Our flexible meal benefit is provided via a Visa-branded credit card. The balance can be used however each person prefers. At our headquarters, we offer *free food*, and for families, we provide daycare assistance, parental support programs, and extended maternity and paternity leave. **For education and growth:** Beyond a challenging and highly developmental environment, we offer in-house training platforms and an education assistance program covering 70% of tuition fees for undergraduate degrees and language courses, as well as course and book purchases — so our team never stops learning. **For high-quality remote work:** We offer home office allowance, work equipment, furniture allowance, and partner with WOBA so our employees can use coworking spaces across Brazil whenever they wish. Explore our headquarters in Joinville/SC via **this virtual tour**! **Extras — because the Dream Team deserves them:** We offer a birthday day off, happy hour allowance, referral bonuses, annual goal-based bonuses, stock option plans, and a relaxed, no-dress-code environment!

Source: indeed View original post