Borrowing the Night: Reclaiming Idle Inference GPUs for Research
July 2, 2026
by Runway Platform Team

Production inference demand rises and falls in a daily wave. We built a capacity controller that reallocates GPUs between production and research so production tracks demand without over-provisioning. Using queueing theory we optimized allocations, leading to more GPUs for research overnight and shorter queue waits all day.

In a previous post we described how we use Kueue to lend idle GPUs across research initiatives. Here we focus on reclaiming production GPUs for research during off-peak hours, then returning them before the morning peak.

Production inference demand moves in cycles. Even with a global user base, traffic concentrates around North American working hours: it climbs through the morning, peaks around 9am ET and bottoms out around 8pm ET. The trough can be less than half the peak.

That creates a familiar dilemma for every AI company. Provision for peak demand and most GPUs sit idle every night. Provision for trough demand and queues blow out every morning.

Instead, the production fleet should follow the demand cycle: grow into the morning peak, shrink into the evening trough and lend whatever it isn't using to research.

provision for the peak (static)12am6am9am12pm4pm8pm12amhour of day (ET)demand / GPUsGPUs allocated to productionproduction demandyielded to research
Production capacity follows demand up to the morning peak and releases the difference, the shaded area, to research overnight.

We built a lightweight capacity controller to do this reallocation (internally called deckard, after the Blade Runner protagonist who relentlessly reclaims replicants). It is deliberately narrow, managing exactly two things:

  1. Workloads: the replica counts of our Kubernetes inference deployments (one bucket of GPUs per model).
  2. Compute: the size of the underlying cloud GPU node pools, so it can physically move nodes between production and research clusters.

Rather than re-deciding allocations continuously, the controller applies a small set of time windows, each with its own pre-computed schedule:

12am6am12pm6pm12ampeakMonTueWedThuFriSatSunweekday daypeak (8:30–12:30)weekday nightweekend dayweekend night
Five windows, each with its own pre-computed schedule. The high-traffic peak (8:30am–12:30pm ET on weekdays) is carved out as a sub-window so it can scale up harder than the rest of the day.

Windows are coarse on purpose, because moving GPUs between clusters is expensive: draining and tearing a node down on one side and standing it back up on the other takes 20 to 60 minutes on our cloud provider.

It's fair to ask why we make this hard for ourselves and shift capacity between clusters. If research and production shared one multi-tenant Kubernetes cluster, transferring a GPU allocation between them would be a scheduling decision rather than a physical transfer, much like Kueue already lends idle GPUs within a single cluster in the previous post.

We keep the environments in separate clusters anyway. The isolation gives us:

  • Blast-radius containment. A runaway research job, or a single overly broad ClusterRole, can't take down customer-facing inference.
  • Independent infrastructure. Separate clusters can run different Kubernetes versions, GPU drivers and networking stacks, so we can test risky infra upgrades on research without exposing production. We've needed this when a driver or networking change broke on certain versions, or when we want to run a bleeding-edge PyTorch to take advantage of training performance improvements.

Because transfers are slow, the controller predicts the wave for a given window and moves capacity ahead of demand.

A Crash Course in Queueing Theory

So how many GPUs do we need to service requests? One approach is to use queueing theory. The field predates digital computing. In 1909, the Danish mathematician Agner Krarup Erlang studied telephone switchboards to figure out how many circuits were needed so callers rarely hear a busy signal.

Translating the definitions to our domain, there are four key variables:

  • arrival_rate: how fast new generations arrive (requests / second).
  • service_rate: how fast one GPU worker serves requests (requests / second). (So average runtime is 1 / service_rate.)
  • num_servers: how many GPU workers we run.
  • traffic_intensity: how "busy" the system is on average: traffic_intensity = arrival_rate / (num_servers * service_rate).

A motivating example: suppose one GPU worker completes 1 request/sec (service_rate = 1) and you run 100 GPUs for arrival_rate = 95, so traffic_intensity = 0.95. Now say you get a short burst to 110 req/sec for a couple of minutes (or a few GPUs temporarily drain/restart). Backlog accumulates at ~10 req/sec during the burst. Even after demand returns to 95, you only "catch up" at 5 req/sec (100 served − 95 arriving), so that backlog can linger and inflate tail latency far longer than the burst itself.

2-minute burst to 110 req/secarrivals (req / sec)capacity: 100 GPUs × 1 req/sec11010095backlog (queued requests)builds at 10 req/secdrains at only 5 req/sectime →
The queue remembers a burst long after it ends. Backlog builds at 10 req/sec during the burst but drains at only 5 req/sec afterward — the recovery takes twice as long as the burst itself.

It gets worse the closer traffic intensity gets to 1. Write ρ for traffic_intensity. Under Erlang-C assumptions, once every GPU is busy each extra queued job is ρ times as likely as the one before it, so the expected backlog is the geometric series ρ + ρ² + ρ³ + ... = ρ / (1 − ρ). As ρ approaches 1 the denominator vanishes and the backlog diverges.

By Little's Law (jobs in system = arrival rate × time in system), waiting time carries the same factor: expected wait scales like 1 / (1 − ρ) times the service time. Headroom is 1 − ρ, so every halving of it doubles the wait: roughly 10× at 90% utilization, 20× at 95%, 50× at 98%.

ρ = 1: the system saturateswhere you want to run0.90: wait ≈ 10× service time00.50.851.0utilization ρ = traffic_intensityexpected wait
Wait time versus utilization. Past ~85% the curve goes nearly vertical, so a tiny demand spike turns into a huge queue. The headroom to the left of the cliff keeps tail latency bounded; it is not wasted.

So we size capacity so that a high percentile of requests (we target the p98) stays within each queue's latency target. Once you can predict arrival_rate and measure service_rate, there are several ways to choose num_servers. We looked at three.

1. Proportional scaling. Scale GPUs linearly with traffic: double the requests, double the GPUs. It's intuitive but wrong because it holds utilization constant across queues of every size. A big queue tolerates much higher utilization than a small one for the same tail latency.

2. Square Root Staffing. A rule of thumb is to calculate num_servers ≈ offered_load + beta * sqrt(offered_load), where offered_load = arrival_rate / service_rate and beta is a small safety factor.

Why the square root? Arrivals over a short window are approximately Poisson, and a Poisson distribution's standard deviation is the square root of its mean. So the random fluctuation you need to buffer scales like sqrt(offered_load).

This also explains why large pools are more efficient than small ones. The buffer as a fraction of the mean shrinks like 1/sqrt(offered_load), so small queues need proportionally more headroom to hit the same tail-latency goal.

100% utilizationproportional scaling: one fixed utilization for every poolsquare root staffing4 GPUs: ~70% is safe400 GPUs: above 95% is safe100%~78%0%440400pool size (GPUs, log scale)safe utilization for the same tail latency
The same tail-latency target allows higher utilization as pools grow, because the burst buffer shrinks like 1/sqrt(offered_load). A flat utilization target (proportional scaling) is wrong in both directions at once.

3. Marginal gain on top of Erlang-C. Square Root Staffing tells you how to size one queue, but we have many models sharing one finite reservation. The question becomes: which queue benefits most from the next GPU? Marginal gain answers it greedily: score every queue by how much one more GPU would shrink its p98 wait, hand the GPU to the biggest winner, then recompute and repeat. The effect is to pour capacity into whichever queue is closest to its latency cliff.

surplus GPUsnext GPU → biggest winmodel Anear its cliffmodel Bsome headroommodel Ccomfortablep98 waitrecompute scores and repeatwait shaved by one more GPUremaining p98 waitlatency target
One step of the greedy loop. Each queue is scored by how much p98 wait the next GPU would shave off; model A, closest to its latency cliff, wins this round. Scores are recomputed and the loop repeats until every queue clears its target.

Inside the allocate Command

To allocate capacity we built a simple CLI where a user or CI process can run deckard allocate. Running it turns all of this into the schedule files the controller applies — and because the output is just declarative YAML in git, the entire review is a plain git diff. For each time window it:

  1. Pulls demand. Fetch the last 7 days of per-minute SQS arrival counts from AWS CloudWatch for every queue and slice out the minutes belonging to that window. This is the observed arrival_rate.
  2. Grows each workload by marginal gain. A greedy loop repeatedly finds the workload whose queue is most urgent (biggest backlog, or worst p98 relative to its target) and gives it one more GPU's worth of replicas.
  3. Checks the SLO with a simulation. Go beyond closed-form M/M/c with a small discrete-event replay of the queue. The simulator replays minute-level arrivals across several random seeds and takes the worst case, modeling the two-priority discipline (priority requests jump the line). The loop stops as soon as every workload clears its p98 target.
  4. Optionally saturates. A --saturate mode keeps spending any leftover reserved (already-paid-for) headroom after SLOs are met, always feeding the workload with the thinnest safety cushion.

The output is one declarative file per window, which CI applies on merge and hourly:

# Generate every window's schedule from 7 days of real demand
deckard allocate

# Review what would change before anything is applied
git diff live-schedules/
# live-schedules/peak-weekday-day.yaml  (generated)
workloads:
  task-worker-2x-h100-<model>:
    prod-a: 48
    prod-b: 36
compute:
  a3-megagpu-8g:
    us-region-1:
      prod-a:
        counts: [80, 64, 0]

There are knobs for the real world, too. We can size against inflated demand for a known spike (deckard allocate --scale-up 1.1 budgets for +10%), or override a single model's expected rate. That's how we pre-scaled capacity ahead of timed launches like our Gen:48 weekend competition.

CloudWatch SQS7-day, per-minutemarginal-gainloopreplay simp98 SLO checklive-schedules*.yamlCI appliesmerge + hourlyfreed GPUs→ researchdeckard allocate
The full pipeline: observed demand in, declarative schedules out, applied automatically and the capacity it frees flows straight to research.

Once we've determined the minimum capacity needed to meet our SLOs, the controller pulls GPUs out of production as demand falls toward the evening trough. That freed idle capacity is what the Kueue system lends to research and what it reclaims when the morning peak returns.

That handoff doesn't impact large training jobs because research treats the swappable nodes as preemptible. Each node that can move carries a label marking it ephemeral (runwayml.com/compute-shared-with-external-clusters=true); the rest are static, a research-only floor the controller never touches.

In Kueue, jobs on the shared queue fill the ephemeral nodes first and run there as preemptible tenants, while each team's reserved quota stays on static nodes. When production reclaims capacity, the reserved queues take their quota back, the ephemeral jobs are evicted and the node is drained (force-deleting pods so a long training job's PodDisruptionBudget can't hold the transfer hostage) before it is torn down and rebuilt on the production side. A slow transfer then costs research only an interrupted, requeued job instead of a corrupted run.

There is no tradeoff: by sizing production tightly against real demand, we run with fewer GPUs in production and shorter queue waits for users. The night we first watched production drain an entire idle superblock down to zero GPUs, handing all of it to research, was a small celebration on the team.

What We've Learned

Coarse and reviewable beats continuous and clever, for now. Discrete time windows applied from CI have captured most of the available savings with far fewer failure modes than a fully autonomous controller. We'll reach for continuous rebalancing when the remaining gain clearly justifies the added risk.

One top-down controller, many clusters. Operating across multiple clusters and clouds, a single controller that does the right thing globally beats a swarm of per-cluster autoscalers each with a partial view.