Prometheus and Grafana Pool Metrics
This guide is part of Connection Pool Observability, and it focuses on the concrete pipeline that turns raw connection-pool counters into time-series data, dashboards, and alerts. Prometheus scrapes a numeric snapshot of pool state from each application instance or proxy exporter, stores it as time series, and Grafana renders saturation, acquisition-wait, and leak signals from that store. Getting the metric names, scrape cadence, and PromQL right is the difference between a dashboard that predicts exhaustion minutes ahead and one that only confirms an outage after the timeouts have already started.
Key operational takeaways:
- The four signals that matter are utilization (
active/max), pending demand (pending> 0), acquisition-wait tail latency (acquire_secondsp99), and connection churn (creation/timeout counters). - Scrape every 15s for production pools; longer intervals miss short saturation spikes that still cause timeouts.
- Compute utilization and wait percentiles with recording rules so dashboards and alerts read pre-aggregated series instead of evaluating heavy
histogram_quantile()queries on every render. hikaricp_connections_pendingis the earliest leading indicator of exhaustion — it rises before timeout exceptions appear in logs.- Tag every series with
poolandapplication(orjob/instance) so a single dashboard distinguishes between isolated pools and per-instance behavior.
Foundational mechanics
Prometheus is a pull-based system. Each target exposes a plaintext endpoint — /actuator/prometheus for Spring Boot, /metrics for Go and Node services or standalone exporters — and Prometheus issues an HTTP GET on a fixed interval. The exposition format is one metric sample per line: a name, a label set in braces, and a numeric value. Pool metrics arrive as gauges (instantaneous counts like active connections) and histograms (acquisition latency, exposed as _bucket, _sum, and _count series).
The distinction between gauge and histogram dictates how you query each metric, and getting it wrong is the most common reason a dashboard shows the wrong number. A gauge such as hikaricp_connections_active is a level — read it directly, aggregate it with max, avg, or sum, but never apply rate() to it. A histogram such as hikaricp_connections_acquire_seconds is exposed as a family of cumulative buckets (_bucket{le="0.01"}, le="0.05", and so on) plus a running _sum and _count. You never read a histogram bucket directly; you apply rate() to the buckets over a window and feed the result to histogram_quantile() to recover a percentile. Mixing these up — applying rate() to a gauge or reading a raw bucket as if it were a count — produces nonsense that looks plausible on a graph, which is why the PromQL table later in this guide pins the correct shape for each.
Micrometer can publish the timing metrics as either a Prometheus histogram or a client-side summary, controlled by whether you enable percentile histograms (distribution.percentiles-histogram). Histograms expose le buckets that Prometheus aggregates across instances, so a fleet-wide p99 is mathematically sound. Summaries expose pre-computed quantile labels per instance that cannot be averaged across instances — a p99 of p99s is meaningless. For any acquisition-wait percentile you intend to aggregate across replicas, enable histogram buckets; otherwise the cross-instance tail latency on your dashboard is statistically invalid.
The standard HikariCP metric set, produced by Micrometer, is the de facto reference shape. Each is tagged with a pool label matching the configured pool name:
| Metric | Type | Meaning |
|---|---|---|
hikaricp_connections_active |
gauge | Connections currently checked out and executing |
hikaricp_connections_idle |
gauge | Connections sitting idle in the pool, ready to lend |
hikaricp_connections_pending |
gauge | Threads blocked waiting for a connection to free up |
hikaricp_connections |
gauge | Total connections (active + idle) |
hikaricp_connections_max |
gauge | Configured maximumPoolSize |
hikaricp_connections_min |
gauge | Configured minimumIdle |
hikaricp_connections_acquire_seconds |
summary/histogram | Time spent waiting to borrow a connection |
hikaricp_connections_usage_seconds |
summary/histogram | Time a connection stayed checked out |
hikaricp_connections_creation_seconds |
summary/histogram | Time to physically open a new connection |
hikaricp_connections_timeout_total |
counter | Cumulative connectionTimeout failures |
Go’s database/sql package exposes pool state through DB.Stats(). The prometheus/client_golang collector and community wrappers translate that struct into series such as go_sql_open_connections, go_sql_in_use_connections, go_sql_idle_connections, go_sql_wait_count_total, and go_sql_wait_duration_seconds_total. The wait_count and wait_duration pair is Go’s equivalent of HikariCP’s pending and acquire metrics — a rising wait_duration rate means goroutines are queuing for a slot governed by SetMaxOpenConns.
Two differences from the Java model trip up first-time Go observability work. First, database/sql has no instantaneous “pending” gauge; it reports only cumulative wait_count and wait_duration counters, so you must take their rate() to see current queuing rather than reading a level. Second, the MaxIdleClosed and MaxLifetimeClosed counters expose connection churn directly — a high rate(go_sql_connection_max_idle_closed_total[5m]) means SetMaxIdleConns is set too low relative to load and the pool is repeatedly opening and discarding connections, a churn pattern invisible from the open-connection gauge alone.
Node.js services typically use prom-client to register custom gauges around a node-postgres Pool. The pool object exposes pool.totalCount, pool.idleCount, and pool.waitingCount; a Gauge with a collect() callback samples these at scrape time. The resulting series — nodejs_pg_pool_total, nodejs_pg_pool_idle, nodejs_pg_pool_waiting — mirror the HikariCP triplet of total, idle, and pending. Because Node pools are per-process, always carry an instance or pod label so per-worker exhaustion is visible.
const client = require("prom-client");
new client.Gauge({
name: "nodejs_pg_pool_total",
help: "Total connections in the node-postgres pool",
labelNames: ["pool"],
collect() { this.set({ pool: "primary" }, pgPool.totalCount); },
});
new client.Gauge({
name: "nodejs_pg_pool_waiting",
help: "Queued connection requests",
labelNames: ["pool"],
collect() { this.set({ pool: "primary" }, pgPool.waitingCount); },
});
The collect() callback is the important detail: it samples the pool object at the moment Prometheus scrapes, so the gauge is always current rather than reflecting whatever value was last manually set. A frequent Node mistake is calling gauge.set() on an interval inside the app — this double-counts under clustering and drifts from real pool state between intervals. Sampling inside collect() ties the metric directly to the scrape and keeps it honest. Under the Node cluster module or PM2, each worker has an independent pool and its own /metrics; either scrape every worker on a distinct port or use the AggregatorRegistry to merge them, and never collapse the instance label, or a single saturated worker disappears into the fleet average.
Precision sizing & timeout orchestration
The scrape configuration controls resolution. The interval must be short enough that a transient saturation burst produces at least two or three samples, otherwise a 30-second exhaustion event can fall entirely between scrapes and never appear on a graph.
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: spring-services
metrics_path: /actuator/prometheus
scrape_interval: 15s
static_configs:
- targets: ["orders:8080", "billing:8080"]
labels:
application: orders-platform
- job_name: go-services
metrics_path: /metrics
static_configs:
- targets: ["ledger:2112"]
- job_name: node-services
metrics_path: /metrics
static_configs:
- targets: ["gateway:9091"]
The label matrix below maps each metric family to the PromQL aggregation that produces an actionable signal. Keep the by (pool) or by (application, pool) grouping consistent across panels and rules so series line up.
| Signal | Source metric | PromQL | What it tells you |
|---|---|---|---|
| Utilization | hikaricp_connections_active, _max |
hikaricp_connections_active / hikaricp_connections_max |
Fraction of the pool in use; ≥ 0.9 sustained = undersized |
| Pending demand | hikaricp_connections_pending |
max_over_time(hikaricp_connections_pending[1m]) |
Any value > 0 means threads are queuing |
| Acquire wait p99 | hikaricp_connections_acquire_seconds_bucket |
histogram_quantile(0.99, sum by (le, pool) (rate(hikaricp_connections_acquire_seconds_bucket[5m]))) |
Tail borrow latency; compare to connectionTimeout |
| Timeout rate | hikaricp_connections_timeout_total |
rate(hikaricp_connections_timeout_total[5m]) |
Confirmed exhaustion failures per second |
| Go wait latency | go_sql_wait_duration_seconds_total |
rate(go_sql_wait_duration_seconds_total[5m]) |
Seconds spent waiting per second of wall clock |
| Node waiting | nodejs_pg_pool_waiting |
max by (pod) (nodejs_pg_pool_waiting) |
Per-process queue depth |
For utilization to be meaningful, hikaricp_connections_max must reflect the real configured ceiling. If you tune maximumPoolSize per the formulas in HikariCP Configuration Deep Dive, the max series moves with it automatically, so the utilization ratio stays correct across config changes without editing dashboards.
The window length in each rate() and _over_time() call is a deliberate trade-off. A 5-minute window over the acquire-seconds buckets smooths sampling noise and yields a stable p99, but it also lags real events by up to five minutes — acceptable for trend panels, too slow for a tight alert. For the pending gauge, use a short max_over_time(...[1m]) so a single 20-second burst still registers; averaging pending over five minutes hides exactly the transient spikes you most want to catch. Match the alert’s for: duration to the window so the rule does not fire on a one-scrape blip yet still pages within a couple of minutes of sustained pressure. The fleet-wide convention should be: short windows and max_over_time for leading-indicator alerts, longer rate() windows for capacity-planning dashboards.
Cardinality discipline matters as much as window choice. Every distinct combination of application, pool, and instance is a separate time series, and the histogram buckets multiply that by the number of le boundaries. Keep labels to the few that you actually slice on. Do not add per-request, per-query, or per-user labels to pool metrics — they explode series count, slow rule evaluation, and bloat the TSDB without improving any saturation signal. The four labels worth carrying are application, pool, instance, and le (on histograms only).
Production configuration examples
Recording rules pre-compute the expensive percentile and ratio queries once per evaluation interval and store them as new series. Dashboards and alerts then read a cheap gauge instead of re-running histogram_quantile() for every panel and every rule, which keeps both Grafana and the Prometheus rule evaluator fast under fan-out.
groups:
- name: connection-pool.rules
interval: 15s
rules:
- record: pool:utilization:ratio
expr: hikaricp_connections_active / hikaricp_connections_max
- record: pool:acquire_seconds:p99
expr: |
histogram_quantile(
0.99,
sum by (le, application, pool) (
rate(hikaricp_connections_acquire_seconds_bucket[5m])
)
)
- record: pool:timeout:rate5m
expr: rate(hikaricp_connections_timeout_total[5m])
Alerting rules consume the recorded series. Alert on leading indicators — sustained pending threads and high utilization — not only on confirmed timeouts, so the page fires before customer-facing errors begin.
groups:
- name: connection-pool.alerts
rules:
- alert: PoolSaturationImminent
expr: pool:utilization:ratio > 0.9 and max_over_time(hikaricp_connections_pending[1m]) > 0
for: 2m
labels:
severity: warning
annotations:
summary: "Pool {{ $labels.pool }} near saturation"
description: "Utilization above 90% with threads pending for 2m."
- alert: PoolAcquireLatencyHigh
expr: pool:acquire_seconds:p99 > 0.5
for: 5m
labels:
severity: warning
- alert: PoolConnectionTimeouts
expr: pool:timeout:rate5m > 0
for: 1m
labels:
severity: critical
Grafana panel design follows the four-signal model. Build one row per concern rather than one giant graph:
- A saturation time series stacking
hikaricp_connections_activeandhikaricp_connections_idlewith a dashed threshold line athikaricp_connections_max, so the headroom between in-use and ceiling is visible at a glance. Addpool:utilization:ratioas a second axis or a gauge panel with thresholds at 0.75 (amber) and 0.9 (red). - A wait p99 panel plotting
pool:acquire_seconds:p99per pool, with a horizontal marker at theconnectionTimeoutvalue. When the p99 line approaches that marker, timeouts are imminent. - A leak / churn panel showing
rate(hikaricp_connections_creation_seconds_count[5m])(new connections per second — high churn impliesmaxLifetimecycling or leaks) alongsidepool:timeout:rate5mand anyleakDetectionThresholdlog-derived counter. - A pending stat panel using
max_over_time(hikaricp_connections_pending[5m])colored red on any nonzero value, because pending demand is the cleanest early warning.
Use template variables $application and $pool bound to label_values(hikaricp_connections_max, pool) so one dashboard serves every service.
Panel ergonomics decide whether an on-call engineer reads the dashboard correctly at 3 a.m. Set fixed thresholds with color steps rather than relative coloring, so amber and red mean the same fraction on every pool. Pin the y-axis of the saturation panel to start at zero and the utilization gauge to a 0–1 range; auto-scaling makes a half-full pool look alarming and a saturated one look calm. Annotate deploys onto the time series so a sudden churn or utilization shift can be tied to a release at a glance. Keep the panels that answer “are we saturated right now” — pending stat, utilization gauge — at the top-left where the eye lands first, and push capacity-planning trend graphs lower. A dashboard that requires interpretation under pressure is a dashboard that gets ignored during the incident it was built for.
Diagnostics & telemetry
When a panel shows trouble, drill from the aggregate to the instance. Utilization at 1.0 with pending > 0 across all instances means the pool is genuinely undersized for offered load; utilization at 1.0 on a single instance while others idle points to skewed routing or a stuck instance leaking connections. Split any saturation query by (instance) to tell these apart before you change maximumPoolSize.
Rising rate(hikaricp_connections_creation_seconds_count[5m]) with stable load indicates connections are being recycled faster than expected — usually maxLifetime set too low, the database closing idle sessions, or a proxy resetting connections. Correlate with the database-side session count; a churn spike that matches maxLifetime is benign rotation, while churn with no config explanation is a sign of mid-flight connection drops.
For Go services, rate(go_sql_wait_count_total[5m]) climbing while go_sql_open_connections sits pinned at the SetMaxOpenConns ceiling is the unambiguous signature of pool exhaustion. There is no separate pending gauge — the wait counter rate is your saturation signal. These metric-driven workflows are the data layer beneath Detecting Connection Pool Saturation, which covers the broader diagnostic decision tree.
The single most valuable add to any Java stack is the Micrometer-to-Prometheus binding, because it produces the entire hikaricp_* family without custom instrumentation. The exact wiring, the dependency, and the most common reason the metrics fail to appear are covered in Exposing HikariCP Metrics with Micrometer and Prometheus.
Integration & proxy compatibility
A connection proxy adds a second pool with its own saturation surface, and the application-side metrics alone will mislead you when one sits in front of the other. If HikariCP runs against PgBouncer Transaction vs Statement Pooling, the HikariCP pool can look perfectly healthy — low utilization, zero pending — while PgBouncer’s own client-waiting queue is saturated against the backend. Scrape both layers. PgBouncer exposes SHOW POOLS counters through a dedicated exporter; collect cl_waiting and sv_active alongside the HikariCP gauges on the same dashboard so the bottleneck is unambiguous.
The two pools must be sized as a system. A HikariCP maximumPoolSize larger than PgBouncer’s default_pool_size * pool count simply pushes the queue from the application into the proxy, where it shows up as cl_waiting rather than hikaricp_connections_pending. Aligning the metric dashboards across both tiers makes that handoff visible. The same principle applies to cloud proxies: RDS Proxy borrow timeouts surface as application-side acquire_seconds spikes with no corresponding rise in backend database sessions, which is precisely the cross-tier discrepancy a combined dashboard reveals.
For multi-datasource Spring applications, every pool emits its own pool-labeled series, so a single Prometheus job captures all of them. Ensure each HikariDataSource has a distinct pool-name; identical names collapse series and make per-datasource utilization impossible to read.
Service meshes add one more wrinkle. When sidecars proxy database traffic, connection-creation latency picks up the mesh hop, so hikaricp_connections_creation_seconds rises without any pool misconfiguration. Read creation latency as a relative trend per environment rather than against an absolute threshold, and correlate spikes with mesh-level metrics before blaming the pool. The acquire-wait metric, by contrast, measures only the in-process queue and is unaffected by the mesh — which makes it the cleaner saturation signal in meshed deployments.
Common failure patterns & remediation
| Symptom | Root cause | Exact fix | Validation |
|---|---|---|---|
No hikaricp_* series at all |
Micrometer Prometheus binding missing or pool not registered with the registry | Add micrometer-registry-prometheus; bind the pool’s MeterRegistry (see child guide) |
curl -s host:8080/actuator/prometheus | grep hikaricp_connections |
Series exist but _max is 0 |
Pool created before metrics registry attached | Let Spring Boot autoconfigure the DataSource, or set the registry before pool init | hikaricp_connections_max shows configured maximumPoolSize |
| Saturation spikes invisible on graph | scrape_interval too long |
Set scrape_interval: 15s for pool jobs |
Two-plus samples per minute in count_over_time(...[1m]) |
| Dashboard slow, rule lag | histogram_quantile() recomputed per panel |
Move percentiles into recording rules | pool:acquire_seconds:p99 series present |
| Utilization > 1.0 | Stale _max series from removed pool |
Add staleness handling; filter on live application label |
Ratio stays within [0,1] |
| Alert never fires during outage | Alerting only on timeout_total, a lagging signal |
Add pending + utilization leading-indicator alert | PoolSaturationImminent fires in test |
After any sizing change, confirm the dashboard reflects it: pool:utilization:ratio should drop and hikaricp_connections_max should jump to the new ceiling within one scrape interval. If the ratio does not move, the config did not take effect on the running instance — check that the pool was rebuilt, not just the property file.
Related
- Connection Pool Observability — the parent overview covering metrics, tracing, and saturation diagnostics across all pool types.
- Exposing HikariCP Metrics with Micrometer and Prometheus — fix missing
hikaricp_*series at/actuator/prometheus. - Detecting Connection Pool Saturation — the decision tree for reading these metrics during an incident.
- HikariCP Configuration Deep Dive — the sizing and timeout parameters these dashboards measure.
- PgBouncer Transaction vs Statement Pooling — the proxy tier to scrape alongside the application pool.