Job Description

Moonlite delivers high-performance AI infrastructure for organizations running intensive computational research, large-scale model training, and demanding data processing workloads.We provide infrastructure deployed in our facilities or co-located in yours, delivering flexible on-demand or reserved compute that feels like an extension of your existing data center. Our team of AI infrastructure specialists combines bare-metal performance with cloud-native operational simplicity, enabling research teams and enterprises to deploy demanding AI workloads with enterprise-grade reliability and compliance.Your Role:We are seeking a Senior Infrastructure Engineer to design, deploy, and manage the physical infrastructure powering Moonlite's GPU clusters and high-performance computing environments. You will be responsible for building and operating scalable, reliable compute, storage, and networking infrastructure that powers AI training / inference and research workloads. This role focuses on the hardware and provisioning layer—servers, GPUs, networking equipment, firmware, and bare-metal provisioning systems—ensuring our infrastructure is tuned for performance and reliability. You will partner closely with network engineers, systems engineers, and SREs to deliver robust infrastructure at scale.Job ResponsibilitiesInfrastructure Design & Deployment: Architect and implement GPU and compute infrastructure at the server, rack, and system level for AI workloads across co-located data center environments.Bare-Metal Provisioning: Deploy and manage bare-metal servers using provisioning tools like Canonical MAAS, building automated workflows for severe lifecycle management from installation through decommissioning.Hardware & Firmware Management: Develop and maintain systems to monitor hardware health, manage firmware updates across compute/storage/network equipment, and automate recovery processes.GPU Operations: Trouble GPU-related performance issues at the driver, kernel, or firmware level and optimize configurations for training and inference workloads.Infrastructure Automation: Build automation using Ansible, Terraform, and Python to eliminate manual provisioning, streamline patching processes, and enable scalable infrastructure operations.Performance Monitoring: Monitor system performance, identify bottlenecks in compute/storage/networking layers, and proactively address reliability issues or capacity issues.Cross-Team Collaboration: Work closely with network engineers, systems engineers, and SREs to ensure cohesive infrastructure operations and seamless integration with Kubernetes and platform orchestration layers.Vendor Management: Serve as primary point of contact for hardware escalations, RMAs (Return Material Authorization), and vendor relationship for compute/storage/networking equipment.RequirementsExperience: 5+ years in infrastructure engineering, systems engineering, or hardware-focused roles, preferably with AI/HPC workloads.Linux Expertise: Strong background in Linux systems administration, performance tuning, and troubleshooting at the system level. Bare-Metal Provisioning: Hands-on experience with bare-metal provisioning tools (MAAS or similar) and automated deployment workflows.DCIM & Documentation: Familiarity with data center infrastructure management tools (NetBox, Device42, or similar) for asset tracking, network documentation, and maintaining infrastructure source of truth.Hardware & GPU Systems: Familiarity with server hardware, GPU configurations, drivers, and system level performance optimization.Automation Skills: Proficiency with Ansible, Terraform, and scripting (Python, Bash) for infrastructure automation and operational efficiency.Infrastructure Operations: Experience deploying and maintaining physical infrastructure in production data center environments.Problem-Solving: Ability to troubleshoot complex hardware, firmware, and system issues under pressure.Collaboration: Comfortable working with cross-functional teams including network engineers, systems engineers, and platform developers to resolve infrastructure challenges.Preferred QualificationsExperience with GPU workload orchestration platforms (Kubernetes, SLURM) and their infrastructure requirements.Familiarity with high-performance networking (InfiniBand, RDMA, RoCE) and spine-leaf network architectures.Experience with monitoring and observability tools (Prometheus, Grafana).Understanding of Kubernetes infrastructure requirements (compute, storage, networking layer)Exposure to co-located data center operations or building infrastructure for regulated environmentsBackground supporting research institutions, HPC facilities, or enterprise AI infrastructureKey TechnologiesLinux, Canonical MAAS, NetBox, Terraform, Ansible, Python, NVIDIA GPU Drivers/Tools, High-Performance Networking, Enterprise Storage Systems, Prometheus, GrafanaWhy MoonliteBuild the Future of AI Infrastructure: Join a pioneering team shaping scalable solutions for the enterprise. Your work will directly impact the deployment and usability of AI at scale.Hands-On Ownership: As an early engineer, you'll have end-to-end ownership of projects and the autonomy to influence our product and technology direction.Collaborate with Experts: Work alongside seasoned engineers and industry professionals passionate about high-performance computing, innovation, and problem-solving.Start-Up Agility with Industry Impact: Enjoy the dynamic, fast-paced environment of a startup while making an immediate impact in an evolving and critical technology space.We offer a competitive total compensation package combining a competitive base salary, startup equity, and industry-leading benefits. The total compensation range for this role is $165,000 – $225,000, which includes both base salary and equity. Actual compensation will be determined based on experience, skills, and market alignment. We provide generous benefits, including a 6% 401(k) match, fully covered health insurance premiums, and other comprehensive offerings to support your well-being and success as we grow together.

Senior Infrastructure Engineer

Job Description

Job Application Tips

You May Also Be Interested In

Cloud Infrastructure Engineer (Remote - US)

Data Engineer I

Data Engineer

Technology Support Specialist

Data Engineer

Systems Analyst

Job Description

Job Application Tips

Share this job

You May Also Be Interested In

Cloud Infrastructure Engineer (Remote - US)

Data Engineer I

Data Engineer

Technology Support Specialist

Data Engineer

Systems Analyst

Apply for this Job

This Job Has Expired