···
Log in / Register

Senior AWS Platform Engineer (HPC Enablement)

Indeed
Full-time
Onsite
No experience limit
No degree limit
79Q22222+22
Favourites
Share

Description

Summary: Seeking a Senior Cloud Engineer to own and operate an AWS platform, build standardized infrastructure, automation, and observability to support HPC workloads at scale. Highlights: 1. Own and operate a critical AWS platform for HPC workloads 2. Build standardized infrastructure, automation, and observability 3. Lead technical ownership and drive standards across teams We are looking for a **Senior Cloud Engineer** to own and operate an AWS platform that enables an HPC team to run workloads reliably at scale. You will build standardized infrastructure, automation, observability, and scaling across multi\-account AWS and Kubernetes—apply to help deliver robust cloud foundations. **Responsibilities** * Own the AWS environment and platform operations that support HPC workloads at scale * Provision and manage AWS accounts via internal self\-service tooling and standardized patterns * Build and maintain Terraform code to provision AWS resources and HPC\-oriented clusters * Design and operate centralized CI/CD pipelines to manage all accounts and clusters from a single repository * Migrate remaining AWS accounts into the central repository and standardize infrastructure patterns * Operate and support an in\-cluster container registry (Harbor) and related platform components * Implement and complete observability rollout across the AWS environment, including metrics, logs, dashboards, and alerting * Support Kubernetes cluster operations and troubleshoot platform issues impacting HPC workloads * Own and improve Cast AI as the primary mechanism for cluster scaling and optimization * Design and support cross\-cloud data transfer and networking solutions such as AWS DataSync and Interconnect between AWS and GCP * Collaborate with the HPC team to translate requirements into implemented platform solutions * Coordinate working hours to maintain at least 4 hours overlap with Houston time zone and occasional overlap with Australia **Requirements** * 3\+ years of hands\-on experience with Amazon Web Services in multi\-account environments * Infrastructure\-as\-code experience with Terraform (HCL/tofu), including modules and state * Kubernetes operations experience, including troubleshooting clusters and workloads * Proven ability to lead technical ownership as a staff\-level individual contributor and drive standards across teams * Strong project execution skills to take requirements, evaluate options, and deliver solutions with minimal guidance * Advanced programming skills in Python for automation, tooling, and integrations * Strong scripting skills in Bash for operational automation * Solid CI/CD and GitOps workflow knowledge using tools such as GitLab CI or GitHub Actions * Strong observability skills across metrics, logs, dashboards, and alerting using Prometheus and Grafana * Experience with cluster scaling and cost optimization using Cast AI or similar tooling * Ability to use AI\-assisted tools for code generation, debugging, and documentation in daily work * Upper\-Intermediate English proficiency (CEFR B2\) **Nice to have** * Google Cloud Platform experience, especially in cross\-cloud integrations with AWS * High\-performance computing (HPC) experience with schedulers or data\-intensive pipelines

Source:  indeed View original post
João Silva
Indeed · HR

Company

Indeed
Cookie
Cookie Settings
Our Apps
Download
Download on the
APP Store
Download
Get it on
Google Play
© 2025 Servanan International Pte. Ltd.