Staff DevOps Engineer
About the Company
Runware is a cutting-edge technology company based in the United Kingdom, dedicated to building the API layer for the next generation of AI products. Our platform provides teams with fast, reliable access to real-time inference across thousands of models through a single flexible API. We enable customers to build and scale media generation products with better performance, lower cost, and less operational complexity.
Behind this is an infrastructure platform built for speed, reliability, and GPU scale. New models launch constantly, and customer traffic can grow quickly. Performance matters at every layer. Our mission is to empower innovators to create and deploy AI solutions that transform industries and improve lives.
At Runware, we value a culture of innovation, collaboration, and continuous learning. We believe in giving our team members the autonomy to make decisions, the freedom to experiment, and the support to grow in their careers. This role matters to our company's growth because it will help us build, operate, and scale the infrastructure behind our global AI inference platform, ensuring that our systems are faster, more resilient, and easier to operate.
Key Responsibilities
As a Staff DevOps Engineer at Runware, you will play a critical role in designing, building, and operating the systems that power real-time AI inference across large-scale GPU fleets and a global production platform. Your work will directly shape how quickly we can launch new models, scale customer traffic, recover from failures, and deliver low-latency AI experiences to millions of users.
- Build and scale the infrastructure that powers real-time AI inference across GPU fleets, bare-metal servers, serverless and containerised production systems
- Help evolve Runware’s platform toward more elastic, on-demand infrastructure that can scale quickly with customer traffic and model demand
- Make Runware faster, more reliable and more resilient by improving the critical paths behind our request entrypoints, inference services, queues, storage, load balancers and networking layer
- Automate the hard parts of infrastructure operations, from provisioning and configuration through to CI/CD, deployment safety, progressive rollouts and rapid rollback
- Build the observability backbone for a high-performance AI platform, with the signals needed to spot issues early, understand capacity and fix problems before customers feel them
- Play a leading role in production operations, incident response, debugging and post-incident improvements, helping us turn operational challenges into a stronger platform
- Strengthen the security and compliance foundations of our infrastructure through patching, secrets management, access controls, hardening, auditability, documentation and repeatable operational processes
- Collaborate with cross-functional teams to ensure seamless integration of infrastructure with other components of the platform
- Develop and maintain documentation of infrastructure design, deployment, and operations
- Participate in on-call rotations to ensure 24/7 coverage of our production systems
Requirements & Qualifications
Must-Have
- Strong experience as a DevOps Engineer, SRE, Infrastructure Engineer, Platform Engineer or similar, with a track record of running production systems at scale
- Deep Linux knowledge and confidence debugging real production issues across networking, storage, performance, services and system behaviour
- Hands-on experience building automation, Infrastructure-as-Code, CI/CD pipelines and deployment workflows that make infrastructure safer and easier to operate
- Experience operating high-availability, low-latency or high-throughput platforms where reliability and performance directly affect customers
- Strong networking fundamentals across TCP/IP, DNS, load balancing, routing, firewalls, proxies, TLS and HTTP
- A calm and pragmatic approach to problem-solving, with the ability to work under pressure and make sound decisions in critical situations
- Excellent communication and collaboration skills, with the ability to work effectively with cross-functional teams
Nice-to-Have
- Experience with GPU-accelerated computing and AI workloads
- Knowledge of containerization technologies such as Docker and Kubernetes
- Familiarity with cloud providers such as AWS, GCP or Azure
- Experience with monitoring and logging tools such as Prometheus, Grafana and ELK
- Knowledge of security best practices and compliance frameworks such as PCI-DSS, HIPAA or SOC 2
Technical Skills
Cloud & Infrastructure
We use a combination of bare-metal servers, serverless and containerised production systems to power our platform. Our infrastructure is built for speed, reliability, and GPU scale, and we are looking for someone who can help us optimize and scale it further.
- AWS
- GCP
- Azure
- Docker
- Kubernetes
Databases
We use a variety of databases to store and manage data for our platform, including relational and NoSQL databases. Our ideal candidate will have experience with database design, deployment, and operations.
- MySQL
- PostgreSQL
- MongoDB
- Cassandra
CI/CD & Automation
We are looking for someone who can help us automate the hard parts of infrastructure operations, from provisioning and configuration through to CI/CD, deployment safety, progressive rollouts and rapid rollback.
- Jenkins
- GitLab CI/CD
- Ansible
- Terraform
What We Offer
We offer a competitive annual salary ranging from 160000 to 200000, depending on experience. In addition to a competitive salary, we also offer a range of benefits, including:
- Remote flexibility: work from anywhere in the world
- Equity/stock options: own a part of the company
- Learning budget: continuous learning and professional development
- Health/dental/vision: comprehensive health insurance
- PTO policy: generous paid time off
- Equipment stipend: latest technology and tools
- Team culture: collaborative, innovative and dynamic team
We believe in giving our team members the autonomy to make decisions, the freedom to experiment, and the support to grow in their careers. If you are looking for a challenging and rewarding role that will help you grow as a professional, we encourage you to apply.
Frequently Asked Questions
What is the remote work setup like?
We are a fully remote company, and we believe in giving our team members the flexibility to work from anywhere in the world. We use a range of tools to facilitate communication and collaboration, including Slack, Zoom and GitHub.
What is the hiring process and timeline?
We are looking to fill this role as soon as possible, and we will be conducting interviews on a rolling basis. The hiring process typically takes 2-3 weeks, and we will be in touch with you throughout the process to keep you updated on your status.
What is the team size and tech stack?
We are a small but growing team, and we are looking for someone who can help us scale our infrastructure and operations. Our tech stack includes a range of tools and technologies, including AWS, GCP, Azure, Docker, Kubernetes, Jenkins, GitLab CI/CD, Ansible and Terraform.