Move Less, Move Faster: Speeding Up Citus Cluster Scaling | POSETTE: An Event for Postgres 2026

Name: Move Less, Move Faster: Speeding Up Citus Cluster Scaling | POSETTE: An Event for Postgres 2026
Uploaded: 2026-06-18T14:43:08+00:00
Description: Muhammad Usama explains how Citus distributes data across a Postgres cluster and why scaling events can be slow due to safe rebalancing constraints. He then...

Jun 18, 2026 by Muhammad Usama

Muhammad Usama explains how Citus distributes data across a Postgres cluster and why scaling events can be slow due to safe rebalancing constraints. He then walks through concrete improvements—more parallel shard moves and snapshot-based node addition—to reduce data movement and speed up scale-out/scale-in operations.

Overview

This talk focuses on making elastic scaling of distributed PostgreSQL (Citus) faster by reducing and parallelizing the work required during rebalancing.

Why scaling a distributed Postgres cluster can be slow

Scaling isn’t just about adding a VM/node; the limiting factor is often how long it takes to rebalance data safely.
Rebalancing is slow primarily because:
- Data movement is expensive.
- Concurrency is constrained by safety requirements and resource limits.

Minimal mental model of Citus distribution

The talk introduces a basic model for reasoning about Citus scaling time:

Shards: partitions of distributed tables.
Placements: where shard replicas live across workers.
Roles:
- Coordinator: plans/distributes queries and orchestrates operations.
- Workers: store shard data and execute distributed work.

Step 1: Shard rebalancing improvements (more parallelism, fewer bottlenecks)

Two concrete areas highlighted for speeding up rebalancing:

Parallelising reference table copies
- Increase parallelism when copying reference tables to reduce a single-threaded bottleneck.
Fixing locking for parallel shard moves
- Adjust locking behavior so multiple shard moves can proceed safely in parallel.
Orchestrating a parallel rebalancer
- Coordinate multiple concurrent shard moves while respecting safety and resource constraints.

Step 2: Snapshot-based scaling (“move less”)

A second approach aims to reduce how much data must be copied during scaling:

Snapshot-based node addition
- A new worker starts as a clone of an existing worker.
- This can dramatically reduce the amount of data that needs to be copied during subsequent rebalancing.
Faster rebalance by deleting instead of moving
- In some scenarios, it can be faster to remove redundant data rather than moving large volumes across nodes.

Choosing a scaling strategy

The talk closes with guidance on selecting the right approach for a given scale-out/scale-in event, balancing:
- Safety constraints
- Resource limits
- Data movement cost
- Desired scaling time