Alerting on Connection Pool Saturation

This guide is part of Detecting Connection Pool Saturation, and it tackles the failure that follows once detection works: the alert itself misbehaves. A pool saturation alert that fires after the timeouts have already cascaded is useless — the incident channel lights up at the same moment customers see 503s. The opposite failure is just as common: an alert keyed on raw utilization flaps every time a healthy burst pushes the pool to 100% for three seconds, and within a week the on-call team has muted it. The objective here is an alert that fires early enough to act, stays quiet through normal bursts, and does not flap. The symptom you are trying to pre-empt looks like this in the logs:

org.springframework.jdbc.CannotGetJdbcConnectionException: Failed to obtain JDBC Connection;
  HikariPool-1 - Connection is not available, request timed out after 30000ms
  (total=20, active=20, idle=0, waiting=37)

By the time waiting=37 appears in a stack trace, the alert has already failed at its job. A good alert fires while waiting is still a handful and the wait percentile is climbing — minutes earlier.

Rapid Incident Diagnosis

When a saturation alert does fire, the first task is confirming it is real and not a slow-query or leak lookalike. Check three things in order. First, the pending/waiting gauge: hikaricp_connections_pending, PgBouncer cl_waiting, or the rate of Go WaitCount. Sustained non-zero is the confirmation. Second, the acquisition wait percentile against connectionTimeout — a p99 approaching the timeout means requests are nearly being refused. Third, mean hold time: flat hold time with high churn is genuine saturation; rising hold time points at slow queries, and a monotonic active-connection climb that never drains points at a leak. The full discrimination procedure lives in the parent guide; the alert’s job is only to get a human looking at these three within seconds, not minutes.

Threshold Formulas

Three primary conditions, each with a deliberate evaluation window. The window is what separates a real alert from a flapping one — short bursts must not trip a page.

Condition Threshold Window Rationale
Sustained high utilization active / max > 0.85 5 minutes Headroom is gone but timeouts have not started; time to act
Persistent queueing pending > 0 2 minutes Any sustained queue is abnormal; shorter window than utilization because queueing is more acute
Wait approaching timeout acquire_p99 > 0.7 × connectionTimeout 3 minutes Tail requests are within 30% of being refused

The 0.85 utilization threshold is not arbitrary — it sits just below the saturation knee where wait time turns non-linear (see the queueing curve in the parent guide), giving roughly a five-minute lead before queueing becomes severe. The pending > 0 condition uses a shorter two-minute window because a persistent queue is a more direct, more urgent signal than utilization alone; utilization can be high without queueing, but pending cannot. The wait-percentile condition expresses the alert in the same units as connectionTimeout, so it scales automatically if you retune the timeout. For the reasoning behind the timeout value itself, see Connection Acquisition Timeout Strategies.

Capacity Headroom Alert

Beyond the reactive thresholds, a Little’s Law headroom alert fires while the pool is still healthy. Required connections equal arrival rate times mean hold time; alert when that crosses 85% of maximumPoolSize. This is the earliest possible signal — it can fire before pending is ever non-zero, because it predicts saturation from the load itself rather than waiting for the queue to form. It is the leading edge of the alerting strategy described in Detecting Connection Pool Saturation.

Exact PromQL and Alertmanager Rules

The following Prometheus rules implement the three conditions. The for: clause enforces the evaluation window so transient spikes are ignored.

groups:
  - name: connection-pool-saturation
    rules:
      - alert: PoolUtilizationHigh
        expr: |
          hikaricp_connections_active / hikaricp_connections_max > 0.85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Pool {{ $labels.pool }} above 85% utilization for 5m"
          description: "Headroom exhausted; queueing imminent. Active={{ $value }}."

      - alert: PoolPendingAcquisitions
        expr: hikaricp_connections_pending > 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Pool {{ $labels.pool }} has waiting acquisitions for 2m"

      - alert: PoolAcquireWaitNearTimeout
        expr: |
          histogram_quantile(0.99,
            sum(rate(hikaricp_connections_acquire_seconds_bucket[5m])) by (le, pool))
            > 0.7 * 30
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "Pool {{ $labels.pool }} p99 acquire wait within 30% of connectionTimeout"

The 0.7 * 30 in the last rule encodes 70% of a 30-second connectionTimeout; adjust the constant to match your configured timeout in seconds. For Go, swap in the database/sql collector metrics:

      - alert: GoPoolSaturated
        expr: |
          (go_sql_in_use / go_sql_max_open_connections > 0.85)
          and
          (rate(go_sql_wait_count_total[2m]) > 0)
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "Go pool {{ $labels.db_name }} saturated: utilization high and waits rising"

For PgBouncer, alert on the exporter’s cl_waiting and maxwait:

      - alert: PgBouncerPoolWaiting
        expr: |
          pgbouncer_pools_client_waiting_connections > 0
          and
          pgbouncer_pools_maxwait_seconds > 1
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "PgBouncer pool {{ $labels.database }} clients waiting (maxwait>1s)"

Setting up that exporter is covered in Setting Up the PgBouncer Prometheus Exporter, and the broader collection pipeline in Prometheus and Grafana Pool Metrics.

Multi-Window Burn-Rate Alerting

Single-window alerts force a trade-off: a short window is fast but flaps; a long window is stable but slow. The multi-window burn-rate pattern, borrowed from SLO alerting, resolves it by requiring the condition to hold in both a fast and a slow window before paging. This catches sharp saturation quickly while suppressing brief spikes that clear within the slow window.

      - alert: PoolSaturationBurnRate
        expr: |
          (
            avg_over_time((hikaricp_connections_pending > bool 0)[5m:]) > 0.5
            and
            avg_over_time((hikaricp_connections_pending > bool 0)[1h:]) > 0.1
          )
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: "Pool {{ $labels.pool }} sustained saturation (multi-window burn rate)"

The fast window (5m) catches an acute event; the slow window (1h) confirms it is not an isolated blip. The > bool 0 converts the pending gauge into a 1/0 series, and avg_over_time measures the fraction of time the pool was queueing. Requiring 50% of the last 5 minutes AND 10% of the last hour means a single 2-minute spike will not page, but genuine sustained saturation will fire fast.

Avoiding False Positives From Short Bursts

Bursty workloads legitimately drive a pool to 100% utilization for seconds at a time without any degradation — those are not incidents. Three techniques keep them out of the incident channel. First, the for: clause: never alert on an instantaneous breach; require the condition to persist across the evaluation window. Second, prefer pending/wait signals over raw utilization, because utilization can hit 1.0 harmlessly while sustained pending cannot. Third, use the multi-window pattern for the page-worthy critical alert and reserve the simple single-window utilization rule for a low-severity warning. The result is a two-tier policy: warnings inform, criticals page.

Validation and Verification

Before trusting an alert, prove it fires under real saturation and stays quiet under bursts. Load-test the pool to deliberate saturation and confirm the alert transitions to firing within its window. In Prometheus, validate the expression directly against historical saturation in the expression browser, and verify the for: window with count_over_time. Confirm the live signals match what the rule reads:

-- PgBouncer admin console: confirm the waiting/maxwait the exporter reports
SHOW POOLS;
# confirm the alert expression evaluates true during a known saturation window
hikaricp_connections_active / hikaricp_connections_max > 0.85

A useful regression test: replay a known healthy burst through the rule and assert it does NOT fire. An alert that cannot stay silent during normal load will be muted within days, which is the same as having no alert at all.

Frequently Asked Questions

Why alert on utilization at 0.85 instead of 1.0?
Because 1.0 is too late. The saturation knee — where wait time turns non-linear — sits near 0.85 for typical workloads. Alerting there gives roughly a five-minute lead to add capacity or shed load before requests start timing out. An alert at 1.0 fires concurrently with the damage.
Should pending-acquisition alerts use a longer window to avoid noise?
No — keep it short (around two minutes) but rely on the for: clause and the multi-window pattern rather than raising the threshold. Pending greater than zero is inherently more meaningful than high utilization, so it warrants a faster window; the windowing, not a higher threshold, is what suppresses the noise.
How do I keep bursty traffic from constantly tripping the alert?
Alert on sustained pending or rising wait percentiles rather than raw utilization, and always require an evaluation window with for:. The multi-window burn-rate rule is the strongest defense: it demands the condition hold across both a fast and a slow window, so brief bursts that clear within the hour window never page.
What value should I use for the wait-percentile threshold?
Express it as a fraction of connectionTimeout — commonly 70%. If connectionTimeout is 30 seconds, alert when p99 acquire wait exceeds 21 seconds. Tying it to the timeout means the alert rescales automatically whenever you retune the timeout, so the rule never goes stale.
Can I alert before pending is ever non-zero?
Yes, with a Little’s Law headroom rule. Compute required connections as arrival rate times mean hold time and alert when it crosses 85% of maximumPoolSize. This predicts saturation from load and fires before the queue forms, making it the earliest signal in the strategy.