Site Reliability Engineer (SRE) - AI Infrastructure (San Francisco) Job at Hamilton Barnes Associates Limited, San Francisco, CA

aTAwMzJ1K3B6NFFsMzhGTGNrVFd1L0xBckE9PQ==
  • Hamilton Barnes Associates Limited
  • San Francisco, CA

Job Description

Are you looking for an exciting new opportunity?

Join a stealth-mode hyperscale data center startup building a next-generation AI and cloud platform designed for startups and advanced research, powered by thousands of H100, H200, and B200 GPUs available on demand. Their platform supports everything from rapid experimentation to full-scale model training and inference, with flexible orchestration via Slurm, Kubernetes, or direct SSH access.

This is a rare opportunity to work at the intersection of hyperscale infrastructure and AI, shaping the operational backbone of one of the largest GPU clusters in private deployment. If you want to build and operate infrastructure for frontier AI workloads, automate systems at petascale, and be part of a founding engineering team, this is the place to do it.

Responsibilities

  • Design, deploy, and maintain large-scale GPU clusters (H100/H200/B200) for training and inference workloads.
  • Build automation pipelines for provisioning, scaling, and monitoring compute resources across Slurm and Kubernetes environments.
  • Develop observability, alerting, and auto-healing systems for high-availability GPU workloads.
  • Collaborate with ML, networking, and platform teams to optimise resource scheduling, GPU utilization, and data flow.
  • Implement infrastructure-as-code, CI/CD pipelines, and reliability standards across thousands of nodes.
  • Diagnose performance bottlenecks and drive continuous improvements in reliability, latency, and throughput.

Skills / Must Have

  • 7+ years of experience in SRE, DevOps, or Infrastructure Engineering roles supporting large-scale compute environments.
  • Strong handson experience with Kubernetes and Slurm for cluster orchestration and workload management.
  • Deep knowledge of Linux systems, networking, and GPU infrastructure (NVIDIA H100/H200/B200 preferred).
  • Proficiency in Python, Go, or Bash for automation, tooling, and performance tuning.
  • Experience with observability stacks (Prometheus, Grafana, Loki) and incident response frameworks.
  • Familiarity with highperformance computing (HPC) or AI/ML training infrastructure at scale.
  • Background in reliability engineering, distributed systems, or hardware acceleration environments is a strong plus.

Benefits

  • Equity

Salary

  • $300,000 gross per year
#J-18808-Ljbffr

Job Tags

Full time, Flexible hours,

Similar Jobs

Access Health Dental

Patient Coordinator Job at Access Health Dental

 ...Job Description Job Description About the Role: The Patient Coordinator plays a vital role in ensuring a seamless and positive...  ...and completing prep work for the following day Coordinate registration and account activation for new patients, including completion... 

D1 TRAINING

Fitness Director (Miami) Job at D1 TRAINING

 ...Fitness Director D1 Training Join D1 Training Kendall as the Fitness Director, where you will lead a team of dedicated coaches in...  ...Mid-Senior level Employment type Full-time Industry Health, Wellness & Fitness Location Kendall, FL #J-18808-Ljbffr

Merryhill School

Toddler Teacher Job at Merryhill School

 ...services from infant care through Pre-K/K programs, as well as summer camp and after-school programs . Our locations span a nationwide...  ...Merryhill Preschool is seeking a loving and attentive Toddler Teacher to join our team. Our infant program provides a safe,... 

Stonex Group, Inc.

AML Transaction Monitoring Analyst Job at Stonex Group, Inc.

 ...protect the firm from activity that may be unusual and/or indicative of potential money laundering, fraud and other types of financial crime. As part of the AML Group's Transaction Monitoring function, in this role the individual will be responsible for conducting a... 

greene king

General Manager - Live in accommodation available (Greenlawn) Job at greene king

 ...Discount of 33% for you and 15% for your loved ones on all of our brands so you enjoy your favourite food and drink at a discount. ~ Free employee assistance program mental health, well-being, financial, and legal support because you matter!~ Discount of 50% for you...