Back

[Remote] Staff Machine Learning Systems Engineer (MLOps)

Worldwide Salaried Open

Note: The job is a remote job and is open to candidates in USA. Hims & Hers is the leading health and wellness platform, on a mission to help the world feel great through the power of better health. They are seeking a Staff Machine Learning Systems Engineer to design, build, and operate the production infrastructure that powers AI across the company, focusing on critical systems that support AI teams in a regulated healthcare environment.

Responsibilities

  • Own and scale the AI compute and deployment platform
  • Own and evolve our containerized application deployment platform and related systems for AI workloads, encompassing general process and job orchestration (e.g. Kubernetes) — cluster operations, node lifecycle, autoscaling (Karpenter), storage (EBS CSI), and workload isolation across staging and production
  • Build and maintain GitOps-based deployment pipelines (Helm/Kustomize overlays, environment promotion) that let teams ship AI services safely and repeatably
  • Design ephemeral/preview environments, feature-branched deployments, and nightly release pipelines so teams can validate AI changes in production-like conditions before release
  • Drive efficiency and cost management across compute, autoscaling, and inference infrastructure
  • Operate and scale inference infrastructure and a multi-provider LLM AI gateway (e.g. Bedrock, Vertex, and other providers) — including credentials, rate limits, and failover
  • Build reliable serving patterns for LLM-powered workflows: routing, grounding, tool execution, and context assembly at the platform level
  • Create reusable infrastructure abstractions and contracts that standardize how AI services are deployed, configured, and consumed across the company
  • Own the LLM/AI observability and tracing stack — provisioning and scaling systems like Langfuse, Datadog (dd-trace), OpenTelemetry tracing (OTLP), and the underlying datastores (e.g. ClickHouse) — so AI behavior is auditable and debuggable in production
  • Build analytics and monitoring pipelines that surface latency, error, quality, and regression signals to engineering and clinical stakeholders
  • Define SLOs, alerting, on-call runbooks, and incident response for AI infrastructure; lead troubleshooting and continuously raise platform reliability
  • Own and improve the monorepo build system and CI/CD pipelines for AI workloads — including eval workflows, Docker image builds, automated PR checks and convention enforcement, and cross-platform test execution
  • Own shared infrastructure tooling, CLIs, and IaC modules (Terraform, Scalr) that AI and product engineers use daily
  • Identify and eliminate platform bottlenecks — reducing CI/CD cycle times, build latency, and deployment friction — to improve developer velocity across the Applied AI organization
  • Build IAM, OIDC, and secrets management as first-class infrastructure — scoped, least-privilege roles, write-only secret rotation, and cross-account access audits
  • Encode security-by-default, scope boundaries, and access controls into the platform so AI services are HIPAA-compliant and privacy-first
  • Partner with clinical, legal, security, and data platform teams (including Databricks/Unity Catalog access governance) to enforce compliant, auditable data access
  • Drive multi-quarter infrastructure initiatives, from cluster and deployment architecture to inference platform, GPU compute strategy, and observability evolution
  • Write and lead technical design documents and design reviews, define infrastructure standards and development-workflow conventions, and contribute to technical governance across AI engineering
  • Mentor engineers on reliability engineering, infrastructure-as-code, and MLOps best practices, and bridge the gap between prototypes and production-grade systems

Skills

  • 8+ years of professional experience in infrastructure, platform, DevOps, or SRE engineering — with at least 3 years focused on ML/AI systems in production
  • Deep, hands-on experience with Kubernetes (ideally EKS) and the cloud-native ecosystem — autoscaling, GitOps, Helm/Kustomize, operating clusters at scale, and general process/job orchestration
  • Strong infrastructure-as-code skills (Terraform) and experience designing secure cloud architectures: IAM, OIDC, secrets management, and least-privilege access
  • Strong proficiency in Python, with experience building production infrastructure tooling, CLIs, and data/observability pipelines
  • 2+ years of experience operating LLM-based systems in production (LLMOps) — inference routing, serving, tracing, and the reliability patterns needed to run them at scale
  • Hands-on experience with observability/tracing stacks (Datadog, OpenTelemetry, Langfuse, or equivalent) and metrics/log/trace pipelines
  • Experience designing and maintaining CI/CD pipelines, build systems, and developer tooling for fast-moving engineering teams
  • A systems-and-operations mindset: you think about failure modes, SLOs, observability, security, and long-term maintainability before shipping
  • Experience writing and leading technical design documents (TDDs/RFCs) for infrastructure-scale initiatives
  • Strong collaboration skills across engineering, ML, product, security, and clinical teams
  • A deep appreciation for safety, privacy, and security — ideally with experience in a regulated domain such as healthcare, fintech, or life sciences
  • Experience with AWS (EKS, Bedrock, S3, CloudFront, IAM) and multi-cloud (GCP/Vertex AI) inference routing
  • Experience with Databricks (MLflow, Unity Catalog, Spark, Delta) and data platform access governance
  • Experience provisioning LLM observability infrastructure (Langfuse, ClickHouse, OpenTelemetry/OTLP tracing, LogFire) and LLM behavior monitoring
  • Experience with Karpenter, cluster autoscaling, and cost optimization for ML compute
  • Experience with monorepo build systems (Pants, Bazel) and large-scale CI/CD
  • Experience building automated PR-review / convention-enforcement pipelines and developer-workflow standards
  • Familiarity with Vertex AI Agent Builder, Vertex AI Model Registry, or GCP managed AI/ML services as a stretch growth area
  • Contributions to open-source infrastructure, IaC modules, SDKs, or developer tooling projects

Benefits

  • Competitive salary & equity compensation for full-time roles
  • Unlimited PTO, company holidays, and quarterly mental health days
  • Comprehensive health benefits including medical, dental & vision, and parental leave
  • Employee Stock Purchase Program (ESPP)
  • 401k benefits with employer matching contribution
  • Offsite team retreats

Company Overview

  • Hims & Hers Health, Inc. (better known as Hims & Hers) is a multi-specialty telehealth platform building a virtual front door to the healthcare system. It was founded in 2017, and is headquartered in San Francisco, California, USA, with a workforce of 501-1000 employees. Its website is https://www.hims.com.
  • Apply To This Job

    More jobs