Site Reliability Engineer (SRE) - AI Infrastructure (San Francisco) Job at Hamilton Barnes Associates Limited, San Francisco, CA

MW5jaXplTUwvbFowYjIrd1o1S3JqbTdabHc9PQ==
  • Hamilton Barnes Associates Limited
  • San Francisco, CA

Job Description

Are you looking for an exciting new opportunity?

Join a stealth-mode hyperscale data center startup building a next-generation AI and cloud platform designed for startups and advanced research, powered by thousands of H100, H200, and B200 GPUs available on demand. Their platform supports everything from rapid experimentation to full-scale model training and inference, with flexible orchestration via Slurm, Kubernetes, or direct SSH access.

This is a rare opportunity to work at the intersection of hyperscale infrastructure and AI, shaping the operational backbone of one of the largest GPU clusters in private deployment. If you want to build and operate infrastructure for frontier AI workloads, automate systems at petascale, and be part of a founding engineering team, this is the place to do it.

Responsibilities

  • Design, deploy, and maintain large-scale GPU clusters (H100/H200/B200) for training and inference workloads.
  • Build automation pipelines for provisioning, scaling, and monitoring compute resources across Slurm and Kubernetes environments.
  • Develop observability, alerting, and auto-healing systems for high-availability GPU workloads.
  • Collaborate with ML, networking, and platform teams to optimise resource scheduling, GPU utilization, and data flow.
  • Implement infrastructure-as-code, CI/CD pipelines, and reliability standards across thousands of nodes.
  • Diagnose performance bottlenecks and drive continuous improvements in reliability, latency, and throughput.

Skills / Must Have

  • 7+ years of experience in SRE, DevOps, or Infrastructure Engineering roles supporting large-scale compute environments.
  • Strong handson experience with Kubernetes and Slurm for cluster orchestration and workload management.
  • Deep knowledge of Linux systems, networking, and GPU infrastructure (NVIDIA H100/H200/B200 preferred).
  • Proficiency in Python, Go, or Bash for automation, tooling, and performance tuning.
  • Experience with observability stacks (Prometheus, Grafana, Loki) and incident response frameworks.
  • Familiarity with highperformance computing (HPC) or AI/ML training infrastructure at scale.
  • Background in reliability engineering, distributed systems, or hardware acceleration environments is a strong plus.

Benefits

  • Equity

Salary

  • $300,000 gross per year
#J-18808-Ljbffr

Job Tags

Full time, Flexible hours,

Similar Jobs

South Carolina Staffing

School Custodian/Cleaner Job at South Carolina Staffing

 ...Job Opportunity: School Custodian/Housekeeper Defender Services is looking for full and part time school custodians/housekeepers for schools around Fort Mill, SC. Full time positions are available Monday through Friday from 3PM to 11:30PM. Part time positions have... 

Synechron

Power BI Developer Job at Synechron

 ...Get AI-powered advice on this job and more exclusive features. This range is provided...  ...development initiatives in our FinLabs we develop solutions for modernization, from Artificial...  ....00/yr - $130,000.00/yr Role Power BI Developer with a robust background in SQL... 

Avenue Code

Sr. Site Reliability Engineer (SRE) (Mountain View) Job at Avenue Code

 ...Were seeking an experienced, highly collaborative SRE to partner with product teams and tackle our most...  ...and operating our cloud platformand driving the reliability, performance, and security that empower our engineering organization. Responsibilities: Infrastructure... 

i360

Political Analyst Job at i360

 ...Your Job i360 is seeking a Political Analyst to join its Analytics Team. Utilizing your strong analytical skills and understanding...  ...Ahead Bachelor's degree in a relevant field such as Political Science, Statistics, Data Science, or related disciplines.... 

Town of Chapel Hill

Park Maintenance Specialist (Landscaper) Job at Town of Chapel Hill

 ...maintain landscape and grounds of town facilities, right of ways, parks, playgrounds, greenways, trails, athletic fields, and...  ...prunes; blows leaves; mulches. Operates equipment and performs maintenance on equipment; operates equipment according to the operator's manual...