Job Description
I'm looking for a hands-on ML Infrastructure Engineer to help scale and optimize large-scale training systems for robotics and AI. This is a high-impact role working close to the GPUs, driving inference, ML Ops, and distributed training at scale.What you’ll do:Build and maintain infrastructure for large-scale training (scheduling, orchestration, checkpointing, metrics).Scale JAX-based pipelines across GPU/TPU clusters for high-throughput experiments.Optimize performance across data pipelines, model loops, and distributed sync.Partner with researchers to turn ideas into production-ready training runs.Manage cloud compute resources (AWS, GCP TPU/GKE, Kubernetes, SLURM).What we’re looking for:Strong software engineering skills in ML infrastructure/platforms.Hands-on experience with JAX (preferred), PyTorch, or TensorFlow.Proven expertise in distributed training and performance optimization.Strong communicator who thrives collaborating with researchers and engineers.A scrappy, ownership-driven builder who loves scaling systems fast.This is a rare chance to work at the intersection of foundation models and robotics, helping shape the future of physical AI.
Job Application Tips
- Tailor your resume to highlight relevant experience for this position
- Write a compelling cover letter that addresses the specific requirements
- Research the company culture and values before applying
- Prepare examples of your work that demonstrate your skills
- Follow up on your application after a reasonable time period