About the Company

Runware is a cutting-edge technology company based in the United Kingdom, dedicated to building the API layer for the next generation of AI products. Our platform provides teams with fast, reliable access to real-time inference across thousands of models through a single flexible API. We enable customers to build and scale media generation products with better performance, lower cost, and less operational complexity.

Behind this is an infrastructure platform built for speed, reliability, and GPU scale. New models launch constantly, and customer traffic can grow quickly. Performance matters at every layer. Our mission is to empower innovators to create and deploy AI solutions that transform industries and improve lives.

At Runware, we value a culture of innovation, collaboration, and continuous learning. We believe in giving our team members the autonomy to make decisions, the freedom to experiment, and the support to grow in their careers. This role matters to our company's growth because it will help us build, operate, and scale the infrastructure behind our global AI inference platform, ensuring that our systems are faster, more resilient, and easier to operate.

Key Responsibilities

As a Staff DevOps Engineer at Runware, you will play a critical role in designing, building, and operating the systems that power real-time AI inference across large-scale GPU fleets and a global production platform. Your work will directly shape how quickly we can launch new models, scale customer traffic, recover from failures, and deliver low-latency AI experiences to millions of users.

Build and scale the infrastructure that powers real-time AI inference across GPU fleets, bare-metal servers, serverless and containerised production systems
Help evolve Runware’s platform toward more elastic, on-demand infrastructure that can scale quickly with customer traffic and model demand
Make Runware faster, more reliable and more resilient by improving the critical paths behind our request entrypoints, inference services, queues, storage, load balancers and networking layer
Automate the hard parts of infrastructure operations, from provisioning and configuration through to CI/CD, deployment safety, progressive rollouts and rapid rollback
Build the observability backbone for a high-performance AI platform, with the signals needed to spot issues early, understand capacity and fix problems before customers feel them
Play a leading role in production operations, incident response, debugging and post-incident improvements, helping us turn operational challenges into a stronger platform
Strengthen the security and compliance foundations of our infrastructure through patching, secrets management, access controls, hardening, auditability, documentation and repeatable operational processes
Collaborate with cross-functional teams to ensure seamless integration of infrastructure with other components of the platform
Develop and maintain documentation of infrastructure design, deployment, and operations
Participate in on-call rotations to ensure 24/7 coverage of our production systems

Requirements & Qualifications

Must-Have

Strong experience as a DevOps Engineer, SRE, Infrastructure Engineer, Platform Engineer or similar, with a track record of running production systems at scale
Deep Linux knowledge and confidence debugging real production issues across networking, storage, performance, services and system behaviour
Hands-on experience building automation, Infrastructure-as-Code, CI/CD pipelines and deployment workflows that make infrastructure safer and easier to operate
Experience operating high-availability, low-latency or high-throughput platforms where reliability and performance directly affect customers
Strong networking fundamentals across TCP/IP, DNS, load balancing, routing, firewalls, proxies, TLS and HTTP
A calm and pragmatic approach to problem-solving, with the ability to work under pressure and make sound decisions in critical situations
Excellent communication and collaboration skills, with the ability to work effectively with cross-functional teams

Nice-to-Have

Experience with GPU-accelerated computing and AI workloads
Knowledge of containerization technologies such as Docker and Kubernetes
Familiarity with cloud providers such as AWS, GCP or Azure
Experience with monitoring and logging tools such as Prometheus, Grafana and ELK
Knowledge of security best practices and compliance frameworks such as PCI-DSS, HIPAA or SOC 2

Technical Skills

Cloud & Infrastructure

We use a combination of bare-metal servers, serverless and containerised production systems to power our platform. Our infrastructure is built for speed, reliability, and GPU scale, and we are looking for someone who can help us optimize and scale it further.

AWS
GCP
Azure
Docker
Kubernetes

Databases

We use a variety of databases to store and manage data for our platform, including relational and NoSQL databases. Our ideal candidate will have experience with database design, deployment, and operations.

MySQL
PostgreSQL
MongoDB
Cassandra

CI/CD & Automation

We are looking for someone who can help us automate the hard parts of infrastructure operations, from provisioning and configuration through to CI/CD, deployment safety, progressive rollouts and rapid rollback.

Jenkins
GitLab CI/CD
Ansible
Terraform

What We Offer

We offer a competitive annual salary ranging from 160000 to 200000, depending on experience. In addition to a competitive salary, we also offer a range of benefits, including:

Remote flexibility: work from anywhere in the world
Equity/stock options: own a part of the company
Learning budget: continuous learning and professional development
Health/dental/vision: comprehensive health insurance
PTO policy: generous paid time off
Equipment stipend: latest technology and tools
Team culture: collaborative, innovative and dynamic team

We believe in giving our team members the autonomy to make decisions, the freedom to experiment, and the support to grow in their careers. If you are looking for a challenging and rewarding role that will help you grow as a professional, we encourage you to apply.

Frequently Asked Questions

What is the remote work setup like?

We are a fully remote company, and we believe in giving our team members the flexibility to work from anywhere in the world. We use a range of tools to facilitate communication and collaboration, including Slack, Zoom and GitHub.

What is the hiring process and timeline?

We are looking to fill this role as soon as possible, and we will be conducting interviews on a rolling basis. The hiring process typically takes 2-3 weeks, and we will be in touch with you throughout the process to keep you updated on your status.

What is the team size and tech stack?

We are a small but growing team, and we are looking for someone who can help us scale our infrastructure and operations. Our tech stack includes a range of tools and technologies, including AWS, GCP, Azure, Docker, Kubernetes, Jenkins, GitLab CI/CD, Ansible and Terraform.