Fluidstack is the AI Cloud Platform. We build GPU supercomputers for top AI labs, governments, and enterprises. Our customers include Mistral, Poolside, Black Forest Labs, Meta, and more.
Our team is small, highly motivated, and focused on providing a world class supercomputing experience. We put our customers first in everything we do, working hard to not just win the sale, but to win repeated business and customer referrals.
We hold ourselves and each other to high standards. We expect you to care deeply about the work you do, the products you build, and the experience our customers have in every interaction with us.
You must work hard, take ownership from inception to delivery, and approach every problem with an open mind and a positive attitude. We value effectiveness, competence, and a growth mindset.
As a Member of Technical Staff at Fluidstack, you will design, develop, and maintain software solutions that power our AI infrastructure and enable our customers to run complex ML workloads efficiently at scale.
Your responsibilities are aligned with the success of our customers and your teammates, and you'll work side-by-side with them to push forward the state of the art in AI/ML. A day's work may include:
Developing and optimizing job scheduling systems to maximize GPU utilization and throughput for ML workloads
Building and improving software interfaces for cluster management that support PyTorch, JAX, and other ML frameworks
Creating monitoring and observability tools for tracking training progress, resource usage, and system performance
Implementing data pipeline optimizations to accelerate training and inference workflows
Designing and developing APIs and services to integrate with MLflow, Kubeflow, Weights & Biases, and other ML tooling
Writing libraries and utilities to simplify the deployment and management of distributed training jobs
We are looking for candidates who are customer-centric, with a bias to action, and an ability to thrive in ambiguity. We expect communication skills, a low ego, and a positive attitude.
In terms of skills, if any of the below bullet points sound like you, please reach out!
You have developed software for training or serving large-scale ML models (1000+ GPU scale)
You have optimized distributed training performance across multiple nodes and accelerators
You have implemented APIs and interfaces for ML platforms that prioritize developer experience
You have experience with orchestration systems like Kubernetes or SLURM in the context of large scale ML workloads
You have built or contributed to ML infrastructure tools (e.g., Ray, Horovod, DeepSpeed), and have experience with ML experiment tracking and workflow systems (MLflow, Kubeflow, W&B)
After submitting your application, the team will review your resume. If your application passes this stage, you will be invited to a 30 minute screen. If you clear this initial phone interview, you will enter the main process, which consists of three 45 minute interviews: a technical deep dive, customer communications and debugging session, and a culture fit interview.
Our goal is to finish the main process within one week. All interviews will be conducted virtually.
Competitive total compensation package (cash + equity).
Retirement or pension plan, in line with local norms.
Health, dental, and vision insurance.
Generous PTO policy, in line with local norms.
Fluidstack is remote first, but has offices in key locations. For all other locations, we provide access to WeWork.