Hot Config Reload Without Restarts: PostgreSQL LISTEN/NOTIFY at 738K RPS
Hot config reload without restarts: PostgreSQL LISTEN/NOTIFY at 738K RPS
Every API gateway I have worked with has the same problem. You change the config, you restart the process. Connections drop. Requests fail. At low traffic this is fine. At 700K+ requests per second, "fine" stops being an option.
I built a gateway in Go that does not restart for config changes. Config changes propagate to all running instances in under 100ms, zero dropped connections, zero downtime. The trick is PostgreSQL LISTEN/NOTIFY combined with atomic pointer swaps.
The restart problem at scale
Kong reloads nginx workers when you change a route. Envoy does hot restarts with a parent-child process handoff. Nginx needs a full reload signal. These are good systems. But restarts are still restarts. There is always a window where the old config is gone and the new config is not fully loaded.
At low RPS, that window is invisible. At hundreds of thousands of requests per second, a 50ms restart window means tens of thousands of requests hitting a process that is shutting down or still booting. Some fraction of traffic will always land in the gap.
I wanted to remove that gap entirely.
PostgreSQL as the notification bus
The gateway config lives in PostgreSQL. Routes, rate limit rules, auth policies, upstream targets. When an admin saves a change, it writes to the database. Standard stuff.
PostgreSQL has a built-in pub/sub mechanism called LISTEN/NOTIFY. Any connected client can listen on a named channel. When something writes a NOTIFY, every listener gets the message. Built into the database. No extra infrastructure.
I added a trigger on the config tables. When a row changes, the trigger fires a NOTIFY on the "config_changed" channel. Every gateway instance listens on that channel via the pgx driver. When the notification arrives, the instance rebuilds its routing table.
The full path: admin saves config, trigger fires NOTIFY, all instances receive it, each reads new config, builds a new handler, swaps it in. Under 100ms from save to all instances running the new config.
Why atomic pointer swap matters
Rebuilding the handler is the easy part. Swapping it in without breaking anything is the hard part. If you put a mutex around the handler reference, every request takes that lock. At 738K RPS, mutex contention destroys your throughput.
The gateway uses Go's sync/atomic package. The current handler is stored behind an atomic pointer. The hot path does one thing: load the pointer atomically and call the handler. No lock. No contention.
When a config change comes in, the reload path builds a completely new handler in the background. New routing table, new rate limit state, new auth config. All assembled into a single struct. One atomic store swaps the pointer. The old handler keeps serving in-flight requests. New requests get the new handler. There is no moment where the handler is nil or half-built.
In Go: var currentHandler atomic.Pointer[http.Handler]. The serve function calls currentHandler.Load() on every request. The reload function calls currentHandler.Store(newHandler) after building the new one. One atomic load per request. No mutex anywhere on the hot path.
The numbers
With auth middleware and rate limiting active, the gateway handles 738K requests per second on a 16-core machine. The auth check and rate limit lookup together add less than a millisecond of overhead per request. In direct proxy mode, with no middleware, it hits 1.2M RPS on the same hardware.
Config changes propagate to all 12 instances in under 100ms from the moment the database write commits. Median 47ms. P99 was 89ms. The variance comes from instance load at notification time and the config read duration.
During a config reload, request latency does not change. Not at p50, not at p99. The atomic swap is invisible to the hot path.
What went wrong first
The first version had race conditions. I was swapping the handler with a regular pointer assignment. Worked fine at low concurrency. Under load, the Go race detector caught it immediately. A request would read a torn pointer. The fix was obvious: use sync/atomic. It took me longer than I would like to admit to realise that pointer assignment in Go is not atomic by default.
The second problem was the debounce problem. Config changes come in bursts. An admin changes five routes in quick succession. Each fires a NOTIFY. Five concurrent rebuilds start, each reading the database at a slightly different point. Wasted CPU and memory allocation spikes.
The fix: when a notification arrives, start a 50ms timer. If another notification arrives before it fires, reset the timer. When it finally fires, do one rebuild with the latest state. Rapid changes collapse into a single rebuild. The cost is 50ms extra propagation in the burst case. At our scale, that is nothing.
Third was connection drops. If the PostgreSQL connection carrying the LISTEN dies, the gateway stops getting notifications. It keeps serving with stale config. I added a reconnect loop with exponential backoff and a full config resync on every reconnect. Never trust that nothing changed while disconnected.
Why not etcd or Redis pub/sub
The config already lives in PostgreSQL. The management API writes to it. The audit log lives there too. Adding Redis or etcd means another thing to run, monitor, and keep in sync. LISTEN/NOTIFY is not as fast as Redis pub/sub. It does not need to be. The notification is just a trigger. The data read takes a few milliseconds. For config reload, that is fast enough.
What I would change
Config versioning from day one. A monotonic version number on every change. Each instance tracks which version it runs. I added this later. Harder to retrofit than to build in from the start.
And a dry-run mode. Build the new handler, run synthetic requests against it, swap only if they pass. Right now a bad config goes live immediately. New requests hit it until someone reverts.
The point
Hot config reload is not a feature. It is an architecture decision you make early. If your gateway restarts for config changes, you have a ceiling on how much traffic you can handle before restarts become visible to users. PostgreSQL LISTEN/NOTIFY and atomic pointer swaps are not glamorous. But together they let you change routing rules, rate limits, and auth policies across a fleet in under 100ms, zero dropped requests, at 738K RPS. Boring infrastructure work that keeps things running.