Detecting Connection Pool Saturation
This guide is part of Connection Pool Observability, and it focuses on the single hardest signal to read correctly: when a connection pool has run out of headroom. Saturation is not the same as a slow database, an undersized pool, or a connection leak — yet all three present with similar symptoms on a dashboard, and operators routinely misdiagnose one as another. This document defines saturation precisely, separates it from its lookalikes, and maps the exact metrics that prove it across HikariCP, PgBouncer, Go’s database/sql, and PostgreSQL itself.
A connection pool is saturated when demand for connections exceeds supply for a sustained interval: every connection is checked out, and new acquisition requests must queue. The pool is no longer a buffer — it is a bottleneck. The defining quantitative signature is simple: pending acquisitions are greater than zero while active connections equal the configured maximum. Everything else in saturation detection is about confirming that signature, measuring how deep it runs, and predicting how close a healthy pool is to crossing into it.
Key operational takeaways:
- Saturation is proven by two metrics together:
pending > 0ANDactive == max. Either one alone is ambiguous. - Acquisition wait-time percentiles (p95/p99) are the leading indicator; timeout error counts are the lagging indicator that confirms damage already done.
- Little’s Law gives capacity headroom: required connections
= arrival_rate × mean_hold_time. Compare againstmaximumPoolSizeto quantify margin. - Every stack exposes the same physics under different names: HikariCP
PendingThreads, PgBouncercl_waiting/maxwait, GoWaitCount/WaitDuration, PostgreSQLwait_eventon a client backend. - A pool can be 100% utilized without being saturated — the knee is where wait time begins climbing non-linearly, not where utilization hits 1.0.
Defining Saturation Precisely
Utilization alone does not define saturation. A pool that sits at 100% active connections but services every request instantly — because connections are returned faster than new requests arrive — is fully utilized and perfectly healthy. Saturation begins only when requests must wait for a connection that is not yet available. That waiting is the phenomenon; utilization is merely a precondition for it.
The three-part signature of true saturation:
| Signal | Meaning | Metric example |
|---|---|---|
| Active == max | No spare capacity remains | hikaricp_connections_active == hikaricp_connections_max |
| Pending > 0 | Requests are queued, waiting | hikaricp_connections_pending > 0 |
| Wait percentile rising | Queue depth is translating into latency | acquire_p99 trending toward connectionTimeout |
When all three hold simultaneously for more than a few seconds, the pool is saturated. The third signal matters because a momentary spike of one pending thread that clears in 5 ms is statistical noise, not saturation. Saturation is sustained queueing that materializes as user-visible acquisition latency.
The utilization-versus-wait relationship is non-linear, and understanding the shape of that curve is the core skill of saturation detection. Below the knee, adding load barely moves wait time. Past the knee, each additional unit of load produces a disproportionate jump in wait time — this is standard queueing behavior (the M/M/c response curve), and the pool’s behavior tracks it closely.
The practical consequence: an alert keyed purely on “utilization == 100%” fires constantly on healthy bursty workloads and is ignored. An alert keyed on the knee — sustained pending acquisitions plus rising wait percentiles — fires only when there is real degradation. Designing those thresholds is the subject of the companion guide, Alerting on Connection Pool Saturation.
Distinguishing Saturation From Slow Queries and Leaks
The three conditions below all drive active connections toward the maximum and can all generate acquisition timeouts. They demand opposite remediations, so misclassification is expensive: adding connections to a leak makes the leak burn through capacity faster, and adding connections to a slow-query problem just gives the database more concurrent work to fall behind on.
| Condition | What is happening | Distinguishing signal | Correct remediation |
|---|---|---|---|
| True saturation | Real demand exceeds pool supply | High usage count (high checkout churn), short mean hold time, pending climbs and clears with traffic |
Raise maximumPoolSize or add a proxy tier — capacity is the constraint |
| Slow queries | Connections held longer because queries run long | Mean checkout/hold time rising, DB-side query latency rising, checkout count flat | Fix the query/index; pool size is a symptom, not the cause |
| Connection leak | Connections checked out and never returned | Active connections rise monotonically and never fall, even when traffic drops to zero | Find and close the leak; restart restores capacity temporarily |
The decisive discriminator is hold time versus churn. Saturation has high churn (many short checkouts) and short hold times — the pool is busy, not stuck. Slow queries show rising hold time with flat churn. A leak shows active connections that ratchet upward and never recede; the giveaway is that utilization stays high during a traffic trough when it should have drained. HikariCP’s leakDetectionThreshold exists precisely to catch the third case by logging the stack trace of any connection held longer than the threshold.
Leading vs Lagging Indicators
Saturation detection fails most often because operators watch the wrong end of the causal chain. Timeout errors and HTTP 503s are lagging indicators — by the time they appear, requests have already been refused and users are already affected. The job of saturation detection is to read the leading indicators that move minutes earlier.
| Indicator | Type | Lead/lag | Why |
|---|---|---|---|
| Acquisition wait p95/p99 | Leading | Minutes early | Climbs as the queue forms, before any timeout |
| Pending acquisition count | Leading | Seconds–minutes early | Non-zero the instant supply is short |
| Active/max utilization ratio | Leading | Minutes early | Approaches 1.0 before queueing begins |
| Acquisition timeout count | Lagging | Concurrent with damage | Only increments after a request waited past connectionTimeout |
| Application error rate / 503s | Lagging | After damage | Downstream effect of refused acquisitions |
The wait-time percentile is the most valuable single signal. It is continuous (it moves before any threshold is crossed), it is dimensionally comparable to connectionTimeout (you can express the alert as “p99 wait reached 70% of connectionTimeout”), and it captures tail behavior that averages hide. A mean wait of 5 ms with a p99 of 4,000 ms means a meaningful fraction of requests are nearly timing out — invisible in the average. Pairing percentile wait with pending count gives both the depth (percentile) and breadth (count) of the queue. See Connection Acquisition Timeout Strategies for how connectionTimeout itself should be set so these percentiles remain meaningful.
Little’s Law and Capacity Headroom
To detect saturation before it happens, you need to know how much margin the pool has. Little’s Law gives it directly. In steady state, the average number of connections concurrently in use equals the request arrival rate multiplied by the mean time each request holds a connection:
required_connections = arrival_rate (req/s) × mean_hold_time (s)
The mean hold time is the full checkout duration — acquisition plus query execution plus result processing plus return — not just query time. Headroom is the gap between that requirement and the configured ceiling:
headroom = maximumPoolSize − (arrival_rate × mean_hold_time)
A worked example: a service receives 800 requests/second, each holding a connection for a mean of 12 ms (0.012 s). Required connections = 800 × 0.012 = 9.6. With maximumPoolSize = 20, headroom is 20 − 9.6 = 10.4 connections — comfortable. Now suppose a downstream dependency slows and mean hold time rises to 30 ms: required becomes 800 × 0.030 = 24 connections against a ceiling of 20. The pool is now mathematically saturated regardless of how it was sized, and pending will climb. This is why a slow query and a traffic spike produce identical pool symptoms — both inflate one of the two terms in the same product.
The operational use of Little’s Law is continuous: compute required connections from live metrics (arrival rate and mean hold time are both readily measured) and alert when it crosses, say, 85% of maximumPoolSize. That is a leading capacity signal that fires while there is still headroom to act — long before pending climbs.
A second use is variance budgeting. Little’s Law gives the mean requirement, but pools fail at the tail, not the mean. Even when mean required connections sit well below the ceiling, a transient surge in arrival rate or a brief stall in hold time can push instantaneous demand past maximumPoolSize. The safe ceiling therefore must exceed the mean requirement by a margin proportional to demand variance — a bursty service needs more headroom than a steady one at the same average load. This is why a pool sized exactly to its Little’s Law mean still saturates intermittently: the mean was satisfied while the peaks were not.
Sampling Cadence and Aggregation Pitfalls
Saturation lives at sub-second timescales, but most metric pipelines sample every 15, 30, or 60 seconds. A pool can fill, queue 40 requests, and drain entirely between two scrapes, leaving no trace in a gauge that is read instantaneously. This is the single most common reason saturation goes undetected: the dashboard shows a healthy 60% utilization because the scrape happened to land in a trough.
Three defenses close the gap. First, prefer counters and timers over instantaneous gauges for the load-bearing signals. HikariCP’s acquire timer and Go’s WaitCount/WaitDuration are cumulative — they record every blocked acquisition that occurred between scrapes, so a burst that came and went still increments them. An instantaneous pending gauge can miss the same burst entirely. Second, watch maxima, not averages, when a gauge is the only option: a 1-minute average utilization of 60% is consistent with several 100% spikes, and the average hides exactly the events you care about. Where the pool exposes a high-water mark (PgBouncer maxwait is one), trust it over the point reading. Third, align alert windows to scrape intervals — an alert window shorter than a few scrape intervals cannot reliably observe a sustained condition and will produce gaps.
| Signal type | Captures between-scrape bursts? | Use for |
|---|---|---|
Cumulative counter (WaitCount) |
Yes — every event is recorded | Detecting that saturation occurred at all |
| Timer/histogram (acquire latency) | Yes — distribution accumulates | Measuring wait-time percentiles |
High-water gauge (maxwait) |
Partly — peak since last reset | Catching the worst case in a window |
Instantaneous gauge (pending) |
No — only the scrape instant | Live dashboards, not alerting alone |
The practical rule: build alerts on counters and timers, and use instantaneous gauges only to corroborate. A pool that shows zero pending at every scrape but a steadily climbing WaitCount is saturating in the gaps — and the counter is telling the truth.
Saturation Signals Per Stack
Every pool implements the same queueing physics, but each exposes the signature under different metric names. Knowing the exact names is what turns the theory above into an actionable dashboard.
HikariCP (Java)
HikariCP exposes the canonical four gauges via Micrometer or JMX. Saturation is hikaricp_connections_pending > 0 while hikaricp_connections_active == hikaricp_connections_max. The hikaricp_connections_acquire timer carries the wait-time percentiles. In a thread dump, threads parked in com.zaxxer.hikari.pool.HikariPool.getConnection confirm it. Wiring these up is covered in Exposing HikariCP Metrics with Micrometer and Prometheus.
PgBouncer
PgBouncer’s SHOW POOLS is the saturation oracle. The cl_waiting column counts client connections waiting for a server connection — this is pending, directly. The maxwait column reports the longest current wait in seconds, and maxwait_us gives microsecond precision; any non-zero maxwait that persists is saturation at the proxy tier. Because PgBouncer multiplexes many clients onto few server connections, its saturation point is governed by default_pool_size, not the application’s pool. Reading and exporting these is detailed in PgBouncer Metrics Monitoring.
| PgBouncer column | Saturation meaning |
|---|---|
cl_active |
Clients currently using a server connection |
cl_waiting |
Clients queued for a server connection (pending) |
sv_active |
Server connections in use |
sv_idle |
Server connections available — zero plus cl_waiting > 0 is saturation |
maxwait |
Longest current client wait (seconds) — the wait-time signal |
Go database/sql
Go’s sql.DB.Stats() exposes pool state as a struct. WaitCount is the cumulative number of acquisitions that had to block, and WaitDuration is the cumulative time spent blocked. The rate of change of WaitCount is the pending-events signal; dividing the delta of WaitDuration by the delta of WaitCount yields mean wait time per blocked acquisition. InUse == MaxOpenConnections with rising WaitCount is the Go saturation signature. Sizing the two limits that govern it is covered in Configuring SetMaxOpenConns and SetMaxIdleConns.
s := db.Stats()
saturated := s.InUse == s.MaxOpenConnections && s.WaitCount > lastWaitCount
// mean wait per blocked acquire since last sample:
meanWait := (s.WaitDuration - lastWaitDuration) / time.Duration(s.WaitCount-lastWaitCount+1)
PostgreSQL (pg_stat_activity)
The database itself reveals saturation from below. A client backend blocked acquiring a connection does not appear here (it never reached the server), but pg_stat_activity shows whether saturation is caused by connections that are stuck rather than absent. Backends in state = 'idle in transaction' with a long state_change age are the database-side fingerprint of a leak or an over-long transaction starving the pool:
SELECT state, wait_event_type, wait_event, count(*),
max(now() - state_change) AS longest
FROM pg_stat_activity
WHERE datname = current_database()
GROUP BY 1, 2, 3
ORDER BY count DESC;
A pile-up of idle in transaction backends means connections are alive but unusable — distinguish this from genuine demand-driven saturation, which shows active backends churning quickly.
Saturation Propagation and Cascade Patterns
Saturation rarely stays contained in one pool. Once acquisition wait time rises, request-handling threads block waiting for connections, the application’s thread pool fills, and upstream load balancers begin timing out and retrying — which adds more load to an already-starved pool. This is the timeout-storm cascade, and detecting it early at the connection-pool layer is far cheaper than diagnosing it after it has spread to the edge.
Tiered architectures saturate in a specific order, and reading the order tells you where the true bottleneck is:
| Tier | First saturation signal | What it implies |
|---|---|---|
| Application pool (HikariCP / Go) | pending > 0, acquire p99 rising |
Pool ceiling or downstream slowness |
| Proxy (PgBouncer / RDS Proxy) | cl_waiting > 0, maxwait rising |
default_pool_size or backend max_connections limit |
| Database backend | idle in transaction pile-up, lock waits |
Long transactions or contention, not capacity |
If the proxy saturates before the application pool, the constraint is the proxy’s server-side pool, not the app’s maximumPoolSize — adding application connections then makes things worse by deepening the proxy queue. If the database shows lock waits while neither pool is queueing, the problem is contention, and pool metrics are a red herring. Detecting saturation correctly means reading these tiers together, not in isolation; a single-tier dashboard will routinely point at the wrong layer.
The cascade also has a recovery asymmetry worth monitoring: pools saturate fast and drain slow. When load subsides, pending should fall to zero within seconds. If it lingers — connections staying checked out after demand drops — that is the leak fingerprint reasserting itself, and the saturation was never purely demand-driven. Watching the drain curve after a burst is one of the most reliable cheap discriminators available.
From Detection to Action
Detecting saturation is only useful if it triggers the right response, and the right response depends on which lookalike you confirmed. Genuine saturation with healthy hold times means raising maximumPoolSize (within database max_connections limits) or introducing a multiplexing proxy tier. Saturation driven by slow queries means fixing the queries — see how pool tuning interacts with query behavior in HikariCP Configuration Deep Dive. Saturation that is really proxy starvation means tuning default_pool_size and pool_mode, covered in PgBouncer Transaction vs Statement Pooling. And once you can detect it reliably, the next step is to fire a page at the right moment — neither too early nor too late — which is exactly what Alerting on Connection Pool Saturation addresses.
Related
- Connection Pool Observability — the parent overview of pool metrics, dashboards, and alerting.
- Alerting on Connection Pool Saturation — turning these signals into thresholds and PromQL alert rules.
- PgBouncer Metrics Monitoring — reading
cl_waiting,maxwait, and exporting proxy stats. - Prometheus and Grafana Pool Metrics — collecting and visualizing the gauges that prove saturation.
- Connection Acquisition Timeout Strategies — setting
connectionTimeoutso wait percentiles stay meaningful.