




Job Description: Tools and Technical Stack Programming Languages* Python (advanced: Pandas, NumPy, PySpark, Polars) * SQL (expert-level: query optimization, execution plan analysis) * Scala or Java (desirable) * Bash/Shell scripting Data Platforms (expertise in at least 2)* Databricks (notebooks, clusters, jobs, Delta Lake, Unity Catalog, auto-loader) * Snowflake (architecture, performance optimization, time-travel, Iceberg) * AWS (Redshift, S3, Glue, Athena, Lake Formation) * Azure (Synapse Analytics, Data Lake Storage, MS Fabric) * Google BigQuery and Dataflow Transformation, Orchestration, and Processing* dbt with expertise in modularization, testing, documentation, and CI/CD * Apache Spark with performance optimization, partitioning, and caching * Apache Airflow (DAGs, operators, sensors, SLA) or Dagster/Prefect Infrastructure, DevOps, and Version Control* Terraform (multi-cloud infrastructure-as-code) * CloudFormation (AWS) * GitHub Actions or GitLab CI * Docker and containerization * Advanced Git Essential Technical Skills Design and Architecture* Design of scalable, secure, and resilient end-to-end architectures * Understanding of batch processing vs. real-time streaming * Design of data contracts and schema governance * Evaluation of appropriate technologies for each use case Performance and Optimization* Complex SQL: execution plans, indexing, partitioning * Spark optimization: RDD vs. DataFrames, shuffle, memory management * Cloud cost optimization: spot instances, reserved capacity, partitioning * Troubleshooting performance bottlenecks Security and Compliance* RBAC and granular access control * Encryption at rest and in transit * Compliance with LGPD, GDPR, SOC2 * Data masking and anonymization * Secret management (AWS Secrets Manager, Azure Key Vault) Agile Development and Collaboration* Scrum methodologies, sprints, and effort estimation * Cross-functional collaboration with Product, Business, and Engineering teams * Clear communication of complex technical requirements to non-technical audiences Certifications Desirable* Databricks Certified Data Engineer Professional * AWS Certified Data Engineer Associate or Solutions Architect Professional * Microsoft Certified: Azure Data Engineer Associate (DP-203) * Google Cloud Professional Data Engineer (if applicable) Complementary* Terraform Associate Certification * dbt Fundamentals or Advanced * Apache Airflow Fundamentals Behavioral Skills Strategic and Consultative Thinking* Strategic thinking considering trade-offs (cost, complexity, performance) * Consultative mindset: questioning requirements, proposing alternatives, educating stakeholders Problem-Solving and Resilience* Resilience and ability to debug in complex environments * Handling ambiguity and performance bottlenecks Leadership and Development* Mentoring and coaching junior professionals * Elevating team technical proficiency * Executive communication translating technical concepts Continuous Learning and Ownership* Continuous learning aligned with platform evolution * Ownership over delivered solution quality * Effective collaboration in cross-functional environments Competitive Differentiators* Experience in Machine Learning Engineering and MLOps * Contributions to open-source projects (Spark, dbt, Airflow) * State-of-the-art data quality frameworks * Expertise in GenAI/LLM pipelines * Speaking at technical events and publications * Data contracts and API-first data platforms * Data Governance certification * Multilingual fluency (Portuguese + English + Spanish) Data Solution Architecture and Implementation* Design and maintain scalable, resilient, and optimized data pipelines in Lakehouse and Data Mesh architectures * Implement end-to-end solutions (ingestion, transformation, quality, governance) across multiple cloud platforms (AWS, Azure, GCP, Databricks, Snowflake) * Optimize query and storage performance with cost-efficiency focus GenAI and Machine Learning Projects* Serve as technical consultant for GenAI/ML pipelines * Prepare data for model training, fine-tuning, and large-scale inference * Optimize architectures for machine learning workloads Technical Leadership and Strategic Consulting* Act as technical expert with clients, advising on emerging technologies * Train junior professionals through workshops and knowledge sharing * Serve as a bridge between cross-functional teams, translating requirements into technical solutions * Communicate complex recommendations to both technical and executive audiences DevOps and Observability* Implement version control, automated testing, and CI/CD practices * Configure observability and pipeline monitoring * Apply Agile and DataOps methodologies in delivery Mandatory Experience* Solid experience in data engineering, operating autonomously on complex projects from ingestion through to data availability for analytical products. * In-depth experience in cloud environments (AWS, Azure, or GCP), including architecture, pipelines, security, and observability. * Practical expertise in Databricks and/or Snowflake, including Unity Catalog, Delta Lake, Lakehouse, and best practices for governance and versioning. * Proven experience in consulting or multi-client squads, managing multiple concurrent projects with diverse stakeholders. * Active participation in GenAI/ML initiatives, including data preparation, organization, and quality assurance for AI models. * Ability to lead technical discussions, define standards, and guide teams on data engineering and governance best practices. Cloud Platform Expertise* AWS: S3, EC2, Lambda, Glue, EMR, Athena, Lake Formation, Redshift, DataBrew, RDS, DynamoDB, SQS, SNS * Azure: Data Factory, Data Lake Storage, Synapse Analytics, Azure Machine Learning, Cosmos DB, MS Fabric * GCP: BigQuery, Dataflow, Cloud Composer (Airflow), Pub/Sub * Data migration between platforms and evaluation of multi-cloud solutions Data Architecture and Design* Deep knowledge of Lakehouse/Delta Lake and Medallion Architecture (Bronze, Silver, Gold) * Implementation of Data Mesh and domain-driven data architecture * OLAP and OLTP modeling * Design of solutions to process terabytes of data in real-time and batch Essential Technical Stack* Apache Spark and PySpark with distributed workload optimization * Advanced SQL (window functions, CTEs, performance tuning) * dbt for declarative transformation * Apache Airflow or equivalent for orchestration * Terraform for infrastructure-as-code * Git and CI/CD practices 2512120202501126806


