Connection Pool Observability
Connection pool observability is the practice of instrumenting, scraping, and alerting on the runtime state of database connection pools so that saturation, leaks, and acquisition stalls become visible before they degrade application latency. A pool that lacks telemetry fails silently: requests queue behind an exhausted pool, p99 latency climbs, and the only signal the application emits is a generic timeout exception. This overview defines the metric taxonomy every pool exposes, the extraction surfaces per stack, the pull and push pipelines that move those metrics into a time-series store, and the alert formulas that convert raw counters into actionable saturation signals.
The operational scope here is the pool boundary, not the database engine. These metrics describe how the application borrows, holds, and returns connections — active, idle, and pending counts; acquisition wait latency; connection creation time; and timeout and leak counters. Server-side metrics such as pg_stat_activity state distributions, buffer cache hit ratios, and replication lag are complementary but live on the database side of the line. The two telemetry planes correlate during incidents, but they are instrumented, scraped, and alerted on independently.
Key operational takeaways:
- Instrument four core gauges per pool —
active,idle,pending,total— plus an acquisition-wait timer; everything else derives from these. - The single most important saturation signal is sustained
pending > 0: a non-empty acquisition queue means demand has exceededmaximumPoolSize. - Alert on acquisition wait p99 relative to
connectionTimeout: a p99 approaching the timeout means borrow failures are imminent. - Utilization =
active / total; alert when this exceeds 0.90 for 5 minutes, well before the pool throwsConnection is not available. - Use pull scraping for long-lived servers and push (OTLP or Pushgateway) for short-lived jobs and serverless functions that exit before a scrape.
- Track creation time and leak-detection counters as leading indicators: rising creation latency signals backend strain; a nonzero leak counter signals a code-path that never returns connections.
Metric taxonomy & instrumentation surfaces
Every connection pool, regardless of language, exposes the same conceptual state. A pool holds a fixed-ceiling set of physical connections; at any instant each connection is either checked out (active), parked (idle), or being created. Borrow requests that find no available connection queue (pending). The four gauges below form the irreducible core of pool telemetry.
| Metric | Type | Definition | Saturation meaning |
|---|---|---|---|
active (in-use) |
Gauge | Connections currently checked out to application code | Approaches total under load |
idle |
Gauge | Connections open but not checked out | Drops to 0 before pending rises |
pending (waiting) |
Gauge | Threads blocked waiting to acquire | Any sustained value > 0 is saturation |
total |
Gauge | active + idle, capped by maximumPoolSize |
Plateau at max under demand |
| acquisition wait | Histogram | Time from borrow request to grant | p99 vs connectionTimeout |
| creation time | Histogram | Time to open + validate a new connection | Rising p95 signals backend strain |
| timeout counter | Counter | Borrow attempts that exceeded connectionTimeout |
Any increase is a hard failure |
| leak counter | Counter | Connections held past leakDetectionThreshold |
Nonzero indicates a return bug |
The acquisition wait histogram is the highest-value instrument because it captures user-perceived impact directly. A pool can be fully utilized with zero waiters and zero harm; it is the wait time, not the utilization, that translates into request latency. Always record acquisition as a histogram or summary so percentiles are available — a mean wait time hides the tail where SLA breaches live. The relationship between wait percentiles and pool sizing is governed by the same queueing dynamics covered in Pool Architecture & Algorithm Fundamentals.
Extraction surfaces differ sharply by stack. In the JVM, HikariCP registers gauges and timers through Micrometer (hikaricp_connections_active, hikaricp_connections_pending, hikaricp_connections_acquire_seconds) and also exposes the same data over JMX MBeans (HikariPool-1 (com.zaxxer.hikari)) for agents that scrape MBeans rather than the /actuator/prometheus endpoint. PgBouncer has no native Prometheus output; its admin console answers SHOW POOLS (per-database cl_active, cl_waiting, sv_active, sv_idle) and SHOW STATS (cumulative total_xact_count, total_query_time, avg_wait_time), which an exporter polls and translates. Go’s database/sql exposes (*DB).Stats() returning a DBStats struct (InUse, Idle, WaitCount, WaitDuration, MaxIdleClosed) that a collector samples on each scrape. node-postgres emits no metrics by design; you instrument the pool.on('acquire'), pool.on('release'), and pool.on('error') events and read pool.totalCount, pool.idleCount, and pool.waitingCount synchronously.
| Stack | Source surface | Active / In-use | Pending / Waiting | Wait latency |
|---|---|---|---|---|
| HikariCP (JVM) | Micrometer + JMX | hikaricp_connections_active |
hikaricp_connections_pending |
hikaricp_connections_acquire_seconds |
| PgBouncer | SHOW POOLS / SHOW STATS |
cl_active |
cl_waiting |
avg_wait_time |
Go database/sql |
DB.Stats() |
InUse |
derived from WaitCount |
WaitDuration (cumulative) |
| node-postgres | pool events + counts | totalCount - idleCount |
waitingCount |
timed in acquire handler |
A practical hazard: not every surface gives you a ready-made wait percentile. HikariCP’s Micrometer timer is a real histogram. Go’s WaitDuration is a monotonic cumulative counter — you must rate() it against WaitCount to derive an average wait, and you cannot recover percentiles from it at all. node-postgres gives you nothing unless you wrap the acquire path with your own timer. Knowing which metrics are histograms versus counters determines whether p99 alerting is even possible for a given stack.
The instrumentation effort scales inversely with the maturity of the pool’s telemetry. HikariCP and Spring Boot give you a near-zero-code path: enabling the Micrometer registry and exposing the Actuator endpoint surfaces the full metric set automatically.
# application.yml — HikariCP exposes the full metric set with no per-metric code
spring:
datasource:
hikari:
maximum-pool-size: 20
connection-timeout: 30000 # 30s — the denominator for wait p99 alerts
leak-detection-threshold: 60000 # 60s — populates the leak counter
management:
endpoints:
web:
exposure:
include: prometheus # /actuator/prometheus scrape target
metrics:
distribution:
percentiles-histogram:
hikaricp.connections.acquire: true # emit histogram buckets for p99
Go requires an explicit collector because database/sql exposes a struct, not metrics. Register Prometheus gauges and sample DB.Stats() on each scrape via a custom Collector, or use a wrapper that does so. The cumulative WaitDuration must be exposed as a counter and rated downstream:
// Sample DB.Stats() into Prometheus gauges on every scrape.
func (c *poolCollector) Collect(ch chan<- prometheus.Metric) {
s := c.db.Stats()
ch <- prometheus.MustNewConstMetric(c.inUse, prometheus.GaugeValue, float64(s.InUse))
ch <- prometheus.MustNewConstMetric(c.idle, prometheus.GaugeValue, float64(s.Idle))
ch <- prometheus.MustNewConstMetric(c.waitCount, prometheus.CounterValue, float64(s.WaitCount))
// WaitDuration is cumulative ns; rate() it against waitCount for avg wait.
ch <- prometheus.MustNewConstMetric(c.waitMs, prometheus.CounterValue, float64(s.WaitDuration.Milliseconds()))
}
node-postgres is the most manual surface: there is no struct and no histogram, so you time the acquire path yourself inside the pool event handlers and feed a prom-client histogram.
// Wrap the acquire path to produce a real wait histogram for node-postgres.
const acquireSeconds = new client.Histogram({
name: 'pgpool_acquire_seconds', help: 'pool acquire wait',
buckets: [0.001, 0.01, 0.05, 0.1, 0.5, 1, 5],
});
async function withClient(fn) {
const end = acquireSeconds.startTimer();
const conn = await pool.connect(); // resolves when a connection is granted
end(); // records wait time at grant
try { return await fn(conn); } finally { conn.release(); }
}
// Gauges read synchronously: pool.totalCount, pool.idleCount, pool.waitingCount
Operational Boundary: This section defines the pool-side metric vocabulary and where each value originates per stack. It does not cover database-engine counters (pg_stat_activity, SHOW STATUS, wait events) — those are correlated during diagnosis but instrumented through the database, not the pool, and belong to the database observability plane.
Concurrency model & metric mapping
How a metric is read depends on the runtime’s concurrency model, because the model dictates what “active” even means. In a blocking thread-per-request server, each active connection maps to a parked or running worker thread, so active is bounded by both maximumPoolSize and the servlet thread pool. In an event-loop runtime, hundreds of in-flight promises contend for a small pool, so waitingCount can vastly exceed the thread count. In Go, goroutines block on a channel inside database/sql, so WaitCount increments per blocked goroutine even though the OS thread count stays flat.
| Concurrency model | What active reflects |
Where pending accrues |
Primary read surface |
|---|---|---|---|
| Blocking (JVM servlet) | Worker threads holding a connection | HikariCP handoff queue | Micrometer / JMX gauge |
| Event loop (Node) | Promises mid-query | waitingCount of deferred acquires |
Pool event counters |
| Goroutines (Go) | Goroutines in Query/Exec |
Channel-blocked goroutines | DBStats.WaitCount delta |
| Proxy-fronted (PgBouncer) | Server-side sv_active |
Client-side cl_waiting |
Admin SHOW POOLS |
The proxy case is special: a PgBouncer-fronted topology has two pools in series — the application’s local pool and PgBouncer’s server pool — and saturation can occur in either. A healthy application pool that still sees high latency points to cl_waiting rising on the proxy, which is governed by default_pool_size and pooling mode rather than by the app’s maximumPoolSize. The interaction between application pools and transaction-mode multiplexing is detailed in PgBouncer Transaction vs Statement Pooling. Instrumenting only one of the two pools produces a dangerous blind spot during incidents.
The model also dictates the correct scrape cadence. Blocking JVM pools change state on the order of request latency — tens of milliseconds — so a 10–15s scrape captures sustained saturation but will miss sub-second spikes; for spike-sensitive services, rely on the acquisition histogram, which accumulates every borrow between scrapes, rather than on the instantaneous pending gauge. Event-loop and goroutine runtimes can swing waitingCount and WaitCount violently within a single scrape interval, which is precisely why the cumulative counters (WaitCount, WaitDuration) are the trustworthy signal in those stacks: they integrate every wait event regardless of when the scrape lands, whereas a point-in-time gauge can sample the quiet moment between two bursts and report false health.
Operational Boundary: This section maps metric semantics onto concurrency runtimes so a gauge is read correctly. Tuning the underlying pool size to a given concurrency model is a sizing decision covered in the architecture fundamentals, not here.
Pull vs push telemetry pipelines
Pool metrics reach a time-series store through one of two pipeline shapes. The pull model has the metrics backend periodically scrape an HTTP endpoint exposed by the application — /actuator/prometheus for Spring Boot, a /metrics handler wired to a Go collector, or the PgBouncer exporter’s port. The push model has the application emit metrics outward over OTLP to a collector, or to a Pushgateway, on its own schedule. The choice is dictated by process lifetime, not preference.
| Dimension | Pull (Prometheus scrape) | Push (OTLP / Pushgateway) |
|---|---|---|
| Best for | Long-lived servers | Short-lived jobs, serverless, batch |
| Discovery | Service discovery targets | No target needed; client initiates |
| Failure visibility | up == 0 flags dead target |
Absence is silent unless modeled |
| Cardinality control | Central, at scrape config | At SDK / collector level |
| Staleness | Bounded by scrape interval | Bounded by push interval |
Pull is the default and the correct choice for any process that outlives a scrape interval. Prometheus scraping a HikariCP /actuator/prometheus endpoint every 15 seconds yields a clean, self-describing series with a built-in liveness signal: when the target disappears, up goes to 0 and you alert on the scrape failure itself. The deep mechanics of this path — Micrometer registry wiring, scrape config, and recording rules — are the subject of Prometheus and Grafana Pool Metrics.
Push becomes mandatory when the process is shorter than the scrape interval. A serverless function or a Celery worker that handles one request and exits may never be alive when Prometheus comes to scrape; its pool exhaustion event would be invisible. Such workloads emit over OTLP to an always-on collector that aggregates before forwarding, or to a Pushgateway for batch jobs. The trade-off is that absence of data is no longer self-alerting — you must explicitly model expected push frequency. PgBouncer occupies a middle ground: the exporter is itself a long-lived pull target that internally polls the admin console, converting a poll-based source into a scrapeable one.
Regardless of pipeline shape, two operational disciplines keep the data usable. First, precompute the derived saturation series as recording rules so dashboards and alerts evaluate a single cheap series instead of recomputing a quantile over raw buckets on every refresh:
# Prometheus recording rules — evaluate saturation once, reuse everywhere.
groups:
- name: pool_saturation
interval: 15s
rules:
- record: pool:utilization:ratio
expr: hikaricp_connections_active / hikaricp_connections_max
- record: pool:acquire_wait:p99
expr: histogram_quantile(0.99, sum(rate(hikaricp_connections_acquire_seconds_bucket[5m])) by (le, pool))
Second, control label cardinality at the source. Pool metrics are naturally low-cardinality — one series per pool per instance — and they must stay that way. Never attach per-query, per-user, or per-request labels to pool gauges; doing so multiplies series count without diagnostic benefit and can overwhelm the time-series store. The legitimate labels are pool (the named DataSource or DB handle), instance, and job. Anything finer belongs in tracing, not pool metrics.
Operational Boundary: This section governs how metrics travel from pool to store. The internal storage format, retention, and downsampling of the time-series database are infrastructure concerns outside pool observability.
Dashboard design & saturation signals
A pool dashboard exists to answer one question fast: is the pool the bottleneck right now? Effective layouts lead with the saturation triad — utilization, pending count, and acquisition wait p99 — stacked so an on-call engineer reads cause and effect top to bottom. Utilization (active / total) shows headroom; pending shows whether the queue has formed; wait p99 shows the latency the queue is inflicting. A panel that plots only connection counts without the wait histogram answers “how full” but not “how much it hurts,” which is the question that matters.
Saturation has a characteristic signature that good dashboards make obvious. As demand rises, idle drains to zero first while active climbs toward total. Once idle hits zero and active plateaus at maximumPoolSize, any further demand spills into pending, and acquisition wait p99 begins climbing immediately. This ordering — idle exhaustion, then active plateau, then pending growth, then wait-time climb — is the canonical exhaustion sequence, and dashboards should render the four series on a shared time axis so the cascade is visible at a glance.
| Panel | Expression (Prometheus-style) | Healthy | Investigate |
|---|---|---|---|
| Utilization | active / total |
< 0.75 | > 0.90 sustained |
| Idle headroom | idle |
> 1 | = 0 sustained |
| Pending queue | pending |
0 | > 0 for > 1m |
| Wait p99 | histogram_quantile(0.99, acquire_bucket) |
« connectionTimeout |
> 50% of connectionTimeout |
| Creation p95 | histogram_quantile(0.95, create_bucket) |
stable | rising trend |
| Timeout rate | rate(timeout_total[5m]) |
0 | > 0 |
Per-stack panel expressions vary in form but not in intent. For HikariCP the utilization panel is hikaricp_connections_active / hikaricp_connections_max; for Go it is go_sql_in_use / go_sql_max_open; for PgBouncer it is pgbouncer_pools_client_active / (pgbouncer_pools_client_active + pgbouncer_pools_client_waiting) with a separate cl_waiting panel. When tuning the JVM-side numbers behind these panels, HikariCP Configuration Deep Dive maps each gauge back to the parameter that bounds it. The diagnostic discipline of reading these panels under load is expanded in Detecting Connection Pool Saturation.
Operational Boundary: This section covers what to visualize and how saturation presents on a pool dashboard. Building the underlying queries and provisioning the dashboard JSON is implementation detail handled in the related Prometheus and Grafana guide.
Alerting thresholds & formulas
Alerts convert continuous metrics into discrete pages, and the hard part is choosing thresholds that fire on real saturation without flapping on momentary spikes. Pool alerting falls into three tiers: a leading warning that fires while headroom remains, a saturation alert that fires when the queue forms, and a hard-failure alert that fires when borrows are already failing. Each tier needs a sustained for duration to suppress transient noise.
The foundational saturation alert is sustained pending > 0. A pool with pending == 0 is, by definition, serving every borrow without queueing — there is no acquisition pain regardless of how full it is. The instant pending stays above zero for a meaningful window, demand has structurally exceeded maximumPoolSize. Pair this with a wait-percentile alert tied to the timeout: when acquisition wait p99 exceeds roughly half of connectionTimeout, borrow failures are one traffic step away, because the p99 is the leading edge of the distribution whose tail will cross the timeout next.
| Tier | Condition | Threshold formula | for |
|---|---|---|---|
| Warning | High utilization | active / total > 0.90 |
5m |
| Warning | Idle exhausted | idle == 0 |
5m |
| Saturation | Queue forming | pending > 0 |
2m |
| Saturation | Wait tail rising | wait_p99 > 0.5 * connectionTimeout |
2m |
| Critical | Borrows failing | rate(timeout_total[5m]) > 0 |
0m |
| Critical | Leak detected | increase(leak_total[10m]) > 0 |
0m |
| Critical | Exporter down | up{job="pool"} == 0 |
1m |
Two threshold formulas deserve precision. First, utilization: alert at active / total > 0.90 rather than >= 1.0, because a pool pinned at 100% with zero pending is momentarily fine but has no headroom for the next burst — 0.90 sustained over 5 minutes is the leading indicator that buys remediation time. Second, the wait-vs-timeout relationship: if connectionTimeout is 30s and wait p99 reaches 15s, the slowest 1% of borrows are already halfway to outright failure, so the warning should fire there rather than waiting for the timeout counter to increment. The timeout counter itself is a critical, zero-tolerance alert because any increment is a request that failed to get a connection — a user-visible error, not a warning.
Leak alerting is distinct from saturation: a nonzero leakDetectionThreshold counter means a connection was held longer than its expected execution window, which points to a code path — often an unclosed handle in a background task — that never returned its connection. Leaks manifest as a slow, monotonic decline in available connections that eventually presents as exhaustion, so alerting on the leak counter directly catches the root cause before the symptom. Exporter-down (up == 0) is the meta-alert that protects all the others; without it, a crashed exporter silences every pool alert simultaneously, and the absence of alerts is misread as health. Production alert-rule construction and routing for these tiers is detailed in Alerting on Connection Pool Saturation.
Operational Boundary: This section defines threshold formulas and alert tiers for pool metrics. Incident response runbooks, escalation policies, and remediation actions (resizing, restarting, failover) are operational procedures that consume these alerts rather than being defined here.
Failure modes & degradation patterns
Pool failures present through telemetry in recognizable shapes, and the value of observability is that each shape has a distinct metric fingerprint. Misreading one for another sends remediation in the wrong direction — adding connections to a leak, for example, only delays the same exhaustion.
| Failure mode | Metric fingerprint | Distinguishing signal | First check |
|---|---|---|---|
| Pool exhaustion | active = total, pending climbing |
Wait p99 spikes, idle = 0 | Is demand legitimate or a leak? |
| Leak cascade | active rises monotonically, never recedes |
Leak counter > 0, no traffic correlation | Stack traces from leak detector |
| Timeout storm | Timeout counter spikes | Coincides with creation-time spike | Backend reachability / latency |
| Proxy mismatch | App pool healthy, latency high | PgBouncer cl_waiting rising |
SHOW POOLS on the proxy |
| Slow backend | Creation p95 rising, active held longer | DB-side latency correlates | Database query latency |
Pool exhaustion and leak cascade look identical at the moment of failure — both end with active == total and a growing pending queue — but their fingerprints diverge upstream. Genuine exhaustion correlates with a traffic increase and recovers when traffic recedes; a leak rises monotonically regardless of traffic and never recovers without a restart, with the leak counter as the smoking gun. This distinction is exactly why the leak counter is instrumented separately rather than inferred from the gauges.
The proxy-mismatch mode is the most commonly misdiagnosed because the application’s own pool metrics look perfectly healthy. When a PgBouncer-fronted pool shows low active and zero pending on the app side yet requests are slow, the queue has formed on the proxy’s cl_waiting, invisible to any app-side instrument. This is the structural reason both pools in a series topology must be scraped, and the proxy-specific extraction and thresholds are covered in PgBouncer Metrics Monitoring. A timeout storm, by contrast, is recognized by a timeout-counter spike that coincides with a creation-time spike: the pool is trying to open new connections, the backend is slow or unreachable to grant them, and borrows time out waiting on connections that never finish establishing.
Creation-time telemetry is the most underused leading indicator. Under steady state a warm pool opens almost no new connections, so the creation histogram is sparse and flat. A sustained rise in creation p95 — even while utilization looks comfortable — means new connections are taking longer to establish, which precedes a timeout storm by minutes. The usual causes are backend connection saturation (the database is at max_connections and queuing handshakes), TLS negotiation overhead, or an authentication path under load. Plotting creation p95 alongside the timeout counter lets an operator intervene during the slow-handshake phase rather than after borrows have already begun failing. This same creation overhead is the reason aggressive minimumIdle warm-up and bounded maxLifetime recycling matter; the sizing rationale connects back to Pool Architecture & Algorithm Fundamentals, where connection establishment cost feeds directly into steady-state pool sizing.
Operational Boundary: This section classifies failure signatures as they appear in pool telemetry. The corrective actions — emergency resizing, leak code fixes, proxy reconfiguration, backend failover — are remediation procedures owned by the relevant stack and incident-response guides.
Related implementation guides
This overview defines the metrics and pipelines; the following guides implement each surface in production.
- Prometheus and Grafana Pool Metrics — wiring Micrometer and Go collectors to Prometheus and building the saturation dashboard, including the deep dive on Exposing HikariCP Metrics with Micrometer and Prometheus.
- PgBouncer Metrics Monitoring — extracting
SHOW POOLSandSHOW STATSinto scrapeable series and reading proxy-side saturation. - Detecting Connection Pool Saturation — the diagnostic playbook for distinguishing exhaustion, starvation, and proxy queueing under load.
Related
- Pool Architecture & Algorithm Fundamentals — the queueing and sizing theory behind the metrics observed here.
- Prometheus and Grafana Pool Metrics — implement the pull pipeline and dashboard panels.
- PgBouncer Metrics Monitoring — instrument the proxy-side pool in a series topology.
- Detecting Connection Pool Saturation — read saturation signals and triage exhaustion versus starvation.
- HikariCP Configuration Deep Dive — map each JVM pool gauge back to the parameter that bounds it.