Setting Up the PgBouncer Prometheus Exporter

This guide is part of PgBouncer Metrics Monitoring. The problem it solves is concrete: PgBouncer’s admin console gives you a live snapshot through SHOW POOLS and SHOW STATS, but Prometheus and Grafana see nothing. You cannot graph last night’s cl_waiting spike, you cannot alert on maxwait, and the only way to know the pool saturated is to be logged into psql at the exact moment it happened. The fix is a sidecar exporter that authenticates to the console, runs the SHOW commands on every scrape, and republishes the rows as pgbouncer_* metrics on an HTTP endpoint Prometheus can poll.

The symptom that drives most teams here is a vague one: intermittent application-side acquisition timeouts with no PostgreSQL slow-query evidence. Without the exporter, the PgBouncer wait queue is a blind spot in that investigation. With it, the queue becomes a graphable, alertable series and the timeout becomes obvious.

Rapid incident diagnosis

If you are reading this mid-incident and the exporter is not yet deployed, fall back to the console directly and confirm where the wait is before spending time on tooling:

psql -At -h 127.0.0.1 -p 6432 -U pgbstats pgbouncer -c "SHOW POOLS;" \
  | awk -F'|' '$6>0 {print "pool",$1,$2,"cl_waiting="$6,"maxwait="$11}'

Any output means clients are queued right now. If that prints rows, the exporter would have alerted you minutes earlier — which is the reason to finish this setup once the fire is out. The column positions in SHOW POOLS are stable across recent PgBouncer versions, but verify against a plain SHOW POOLS; if your build differs.

Step 1 — create a stats user and grant console access

The exporter must authenticate to the admin console as a role listed in stats_users. Use a dedicated read-only identity so a compromised scraper can never issue PAUSE, RELOAD, or KILL. Add the role to pgbouncer.ini and to the auth file:

[pgbouncer]
listen_addr = 127.0.0.1
listen_port = 6432
auth_type = scram-sha-256
auth_file = /etc/pgbouncer/userlist.txt
stats_users = pgbstats
; longer averaging window steadies the avg_* columns
stats_period = 60

The stats_users grant lets the role run every SHOW command and read the console, but not mutate state. Reload PgBouncer after editing so the new role takes effect:

psql -h 127.0.0.1 -p 6432 -U pgbadmin pgbouncer -c "RELOAD;"

Step 2 — deploy the exporter

Deploy the community prometheus-community/pgbouncer_exporter. It takes a libpq DSN pointing at the console database and exposes metrics on :9127/metrics. Pass the DSN as an environment variable so the password never lands in the process list:

# docker-compose.yml fragment
services:
  pgbouncer-exporter:
    image: prometheuscommunity/pgbouncer-exporter:latest
    command:
      - "--web.listen-address=:9127"
    environment:
      PGBOUNCER_EXPORTER_HOST: "127.0.0.1"
      PGBOUNCER_EXPORTER_PORT: "6432"
      DATA_SOURCE_NAME: "postgres://pgbstats:${PGBSTATS_PASSWORD}@127.0.0.1:6432/pgbouncer?sslmode=disable"
    network_mode: host

The DSN must name pgbouncer as the database — that is the virtual admin database, not one of your real backends. sslmode=disable is appropriate only when the exporter and PgBouncer share a host or a trusted network segment; use require otherwise. On each scrape the exporter issues SHOW POOLS, SHOW STATS, SHOW LISTS, and SHOW DATABASES and maps every row to a labelled metric.

If you run PgBouncer under systemd rather than containers, the same binary runs as a unit. Pin the DSN into an environment file readable only by the exporter user so the password stays off the command line and out of ps:

# /etc/default/pgbouncer-exporter (chmod 600)
DATA_SOURCE_NAME=postgres://pgbstats:CHANGEME@127.0.0.1:6432/pgbouncer?sslmode=disable
# /etc/systemd/system/pgbouncer-exporter.service
[Service]
EnvironmentFile=/etc/default/pgbouncer-exporter
ExecStart=/usr/local/bin/pgbouncer_exporter --web.listen-address=:9127
User=pgbexport
Restart=on-failure

[Install]
WantedBy=multi-user.target

Run one exporter per PgBouncer process. The exporter holds a single console connection per scrape, so it adds one entry to SHOW CLIENTS and a negligible load to the event loop.

Step 3 — add the Prometheus scrape job

Point Prometheus at the exporter endpoint. A 15-second interval matches a 60-second stats_period well; scraping faster than stats_period just resamples the same averaged values.

scrape_configs:
  - job_name: pgbouncer
    scrape_interval: 15s
    static_configs:
      - targets: ["10.0.1.20:9127"]
        labels:
          service: payments-db
          environment: production

After a reload, confirm ingestion by querying pgbouncer_up in the Prometheus expression browser; a value of 1 means the exporter reached the console.

The per-pool labels are what make the metrics usable across a fleet. Every series carries database and user, so a single payments service running both a writer pool (app_rw) and a read-replica pool (app_ro) produces distinct series you can graph and alert on independently. Avoid collapsing them with a blanket sum() in dashboards — a saturated writer pool hidden inside an aggregate that also counts a calm reader pool is exactly the blind spot the exporter exists to remove. Keep the service and environment labels on the scrape job so alerts route to the right team without a relabel rule.

Key exported series

Each console column becomes a metric labelled by database and user, so you can sum or filter per pool.

Metric Source column What it tells you
pgbouncer_pools_client_waiting_connections cl_waiting Clients queued for a server connection — the saturation signal.
pgbouncer_pools_client_active_connections cl_active Clients currently bound to a server connection.
pgbouncer_pools_server_active_connections sv_active Server connections lent out; compare to default_pool_size.
pgbouncer_pools_server_idle_connections sv_idle Spare backend capacity.
pgbouncer_pools_client_maxwait_seconds maxwait (+maxwait_us) Longest current client wait, in seconds. Alert on this.
pgbouncer_stats_queries_duration_seconds_total derived from avg_query_time Backend query time; rising means the database is the bottleneck.
pgbouncer_databases_current_connections current_connections Per-database backend usage vs max_connections.

Mathematical sizing / utilization formula

To turn sv_active into a saturation ratio, divide it by the configured pool size. Define pool utilization as:

utilization = pgbouncer_pools_server_active_connections / default_pool_size

A pool sustaining utilization near 1.0 with any cl_waiting > 0 is undersized. Worked example: a payments pool with default_pool_size = 50 averaging sv_active = 49 and cl_waiting = 7 at maxwait = 2s is fully lent and queuing. If avg_query_time is flat at, say, 4 ms, the backend is fast and the remedy is more server connections — raise default_pool_size to roughly ceil(50 * (50 + 7) / 50) = 57, then round to 60 and confirm cl_waiting returns to 0. If instead avg_query_time had jumped to 40 ms, the same cl_waiting reflects slow queries holding connections ten times longer, and resizing only pushes more concurrency onto a strained database.

Exact remediation & alert rules

Express alerts on the waiting and maxwait series so transient micro-bursts do not page. The for clause is what separates a real saturation event from a one-scrape blip.

groups:
  - name: pgbouncer
    rules:
      - alert: PgBouncerClientsWaiting
        expr: sum by (database) (pgbouncer_pools_client_waiting_connections) > 0
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "PgBouncer pool {{ $labels.database }} has waiting clients"
          description: "Clients have been queued for >2m; pool may be saturated."

      - alert: PgBouncerMaxwaitHigh
        expr: max by (database) (pgbouncer_pools_client_maxwait_seconds) > 5
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "PgBouncer {{ $labels.database }} maxwait above 5s"

      - alert: PgBouncerExporterDown
        expr: pgbouncer_up == 0
        for: 5m
        labels:
          severity: warning

Apply with a zero-downtime reload — Prometheus rule changes never interrupt PgBouncer or the database:

curl -X POST http://localhost:9090/-/reload

For routing these alerts and the broader saturation strategy, see Alerting on Connection Pool Saturation.

Validation & verification

Confirm the pipeline end to end. First check the exporter is scraping the console:

curl -s http://10.0.1.20:9127/metrics | grep pgbouncer_pools_client_waiting

Then induce a controlled wait to prove the series moves: drop default_pool_size to a small value, run a short load test, and watch pgbouncer_pools_client_waiting_connections climb in Prometheus while SHOW POOLS; shows the same cl_waiting in the console. The two must agree. Reading the raw columns those metrics come from is covered in PgBouncer Metrics Monitoring.

A PromQL sanity check confirms the saturation expression you will alert on actually fires under that synthetic load:

# pools where backend connections are fully lent AND clients are queued
(pgbouncer_pools_server_idle_connections == 0)
  and (pgbouncer_pools_client_waiting_connections > 0)

During the load test this should return the test pool; once you restore default_pool_size and the queue drains, it should return nothing. If the expression returns a pool when the console shows sv_idle > 0, your scrape interval is too coarse and is aliasing across queue cycles — shorten scrape_interval or lengthen stats_period so samples and averaging line up. Cross-check the application side too: if the framework pool such as the one in HikariCP Configuration Deep Dive reports pending threads while PgBouncer shows no waiting clients, the bottleneck is upstream of the proxy.

Frequently Asked Questions

Why does the exporter connect to a database called pgbouncer?
That is PgBouncer’s virtual admin console, not one of your application databases. The SHOW commands only exist there. The DSN must end in /pgbouncer, and the connecting role must be in stats_users or admin_users.
Do I need admin_users or is stats_users enough?
stats_users is sufficient and safer. It permits every SHOW command the exporter runs but cannot issue PAUSE, RESUME, RELOAD, or KILL. Reserve admin_users for human operators.
The avg_query_time metric looks jumpy. Why?
The avg_* columns are averaged over stats_period (60 s in these examples) and reset each period. If you scrape much faster than stats_period you resample the same window; if much slower you miss detail. Align scrape_interval to roughly a quarter of stats_period.
Can one exporter cover multiple PgBouncer instances?
Run one exporter per PgBouncer process and give each its own scrape target with distinct labels. A single exporter takes a single DSN, so multiplexing instances through one exporter mixes their pools and loses per-instance labels.
Should I alert on cl_waiting or maxwait?
Alert on maxwait for paging because it encodes how long clients have actually been blocked, and treat cl_waiting > 0 for several minutes as a warning. A brief cl_waiting spike with sub-second maxwait is normal burst behavior, not an incident.