[Remote] Staff Machine Learning Systems Engineer (MLOps)

Worldwide Salaried Open

Note: The job is a remote job and is open to candidates in USA. Hims & Hers is the leading health and wellness platform, on a mission to help the world feel great through the power of better health. They are seeking a Staff Machine Learning Systems Engineer to design, build, and operate the production infrastructure that powers AI across the company, focusing on critical systems that support AI teams in a regulated healthcare environment.

Responsibilities

Own and scale the AI compute and deployment platform
Own and evolve our containerized application deployment platform and related systems for AI workloads, encompassing general process and job orchestration (e.g. Kubernetes) — cluster operations, node lifecycle, autoscaling (Karpenter), storage (EBS CSI), and workload isolation across staging and production
Build and maintain GitOps-based deployment pipelines (Helm/Kustomize overlays, environment promotion) that let teams ship AI services safely and repeatably
Design ephemeral/preview environments, feature-branched deployments, and nightly release pipelines so teams can validate AI changes in production-like conditions before release
Drive efficiency and cost management across compute, autoscaling, and inference infrastructure
Operate and scale inference infrastructure and a multi-provider LLM AI gateway (e.g. Bedrock, Vertex, and other providers) — including credentials, rate limits, and failover
Build reliable serving patterns for LLM-powered workflows: routing, grounding, tool execution, and context assembly at the platform level
Create reusable infrastructure abstractions and contracts that standardize how AI services are deployed, configured, and consumed across the company
Own the LLM/AI observability and tracing stack — provisioning and scaling systems like Langfuse, Datadog (dd-trace), OpenTelemetry tracing (OTLP), and the underlying datastores (e.g. ClickHouse) — so AI behavior is auditable and debuggable in production
Build analytics and monitoring pipelines that surface latency, error, quality, and regression signals to engineering and clinical stakeholders
Define SLOs, alerting, on-call runbooks, and incident response for AI infrastructure; lead troubleshooting and continuously raise platform reliability
Own and improve the monorepo build system and CI/CD pipelines for AI workloads — including eval workflows, Docker image builds, automated PR checks and convention enforcement, and cross-platform test execution
Own shared infrastructure tooling, CLIs, and IaC modules (Terraform, Scalr) that AI and product engineers use daily
Identify and eliminate platform bottlenecks — reducing CI/CD cycle times, build latency, and deployment friction — to improve developer velocity across the Applied AI organization
Build IAM, OIDC, and secrets management as first-class infrastructure — scoped, least-privilege roles, write-only secret rotation, and cross-account access audits
Encode security-by-default, scope boundaries, and access controls into the platform so AI services are HIPAA-compliant and privacy-first
Partner with clinical, legal, security, and data platform teams (including Databricks/Unity Catalog access governance) to enforce compliant, auditable data access
Drive multi-quarter infrastructure initiatives, from cluster and deployment architecture to inference platform, GPU compute strategy, and observability evolution
Write and lead technical design documents and design reviews, define infrastructure standards and development-workflow conventions, and contribute to technical governance across AI engineering
Mentor engineers on reliability engineering, infrastructure-as-code, and MLOps best practices, and bridge the gap between prototypes and production-grade systems

Skills

8+ years of professional experience in infrastructure, platform, DevOps, or SRE engineering — with at least 3 years focused on ML/AI systems in production
Deep, hands-on experience with Kubernetes (ideally EKS) and the cloud-native ecosystem — autoscaling, GitOps, Helm/Kustomize, operating clusters at scale, and general process/job orchestration
Strong infrastructure-as-code skills (Terraform) and experience designing secure cloud architectures: IAM, OIDC, secrets management, and least-privilege access
Strong proficiency in Python, with experience building production infrastructure tooling, CLIs, and data/observability pipelines
2+ years of experience operating LLM-based systems in production (LLMOps) — inference routing, serving, tracing, and the reliability patterns needed to run them at scale
Hands-on experience with observability/tracing stacks (Datadog, OpenTelemetry, Langfuse, or equivalent) and metrics/log/trace pipelines
Experience designing and maintaining CI/CD pipelines, build systems, and developer tooling for fast-moving engineering teams
A systems-and-operations mindset: you think about failure modes, SLOs, observability, security, and long-term maintainability before shipping
Experience writing and leading technical design documents (TDDs/RFCs) for infrastructure-scale initiatives
Strong collaboration skills across engineering, ML, product, security, and clinical teams
A deep appreciation for safety, privacy, and security — ideally with experience in a regulated domain such as healthcare, fintech, or life sciences
Experience with AWS (EKS, Bedrock, S3, CloudFront, IAM) and multi-cloud (GCP/Vertex AI) inference routing
Experience with Databricks (MLflow, Unity Catalog, Spark, Delta) and data platform access governance
Experience provisioning LLM observability infrastructure (Langfuse, ClickHouse, OpenTelemetry/OTLP tracing, LogFire) and LLM behavior monitoring
Experience with Karpenter, cluster autoscaling, and cost optimization for ML compute
Experience with monorepo build systems (Pants, Bazel) and large-scale CI/CD
Experience building automated PR-review / convention-enforcement pipelines and developer-workflow standards
Familiarity with Vertex AI Agent Builder, Vertex AI Model Registry, or GCP managed AI/ML services as a stretch growth area
Contributions to open-source infrastructure, IaC modules, SDKs, or developer tooling projects

Benefits

Competitive salary & equity compensation for full-time roles
Unlimited PTO, company holidays, and quarterly mental health days
Comprehensive health benefits including medical, dental & vision, and parental leave
Employee Stock Purchase Program (ESPP)
401k benefits with employer matching contribution
Offsite team retreats

Company Overview

Hims & Hers Health, Inc. (better known as Hims & Hers) is a multi-specialty telehealth platform building a virtual front door to the healthcare system. It was founded in 2017, and is headquartered in San Francisco, California, USA, with a workforce of 501-1000 employees. Its website is https://www.hims.com.

Apply To This Job

Apply now

[Remote] Staff Machine Learning Systems Engineer (MLOps)

More jobs

[Remote] Business Analyst - Test Center Operations

[Remote] Recruiter | Turn Your Experience Into a Business

[Remote] Director, Website & SEO

[Remote] Outside Sales Representative

[Remote] Sr. Consultant - Gas and Electric

[Remote] Machine Learning Engineer

[Remote] Senior Consultant, SailPoint ISC

[Remote] Senior Consultant, SailPoint ISC

[Remote] Sales Executive : Sri Lanka (Remote)

[Remote] Security Analyst

Product Associate/Product Manager I - Golf and Grounds Equipment - The Toro Company

Online Arabic language teacher needed ID1111456

Director, Implementation & Client Operations

Experienced Part-Time Remote Data Entry Specialist – arenaflex

Remote (u.s.-based) qa for software engineering firm

Experienced Customer Service Representative – Remote Call Center Opportunity in Healthcare Services

Customer relationship management system District Service Manager

Experienced Customer Alarm Monitoring Agent - 2nd/3rd Shifts at arenaflex

Strategic SaaS Sales Engineer (Remote - USA East)

Experienced Full Stack Data Entry Specialist – Remote Work Opportunity at arenaflex