Detecting ORM Connection Leaks in Production
This guide is part of ORM Connection Lifecycle Hooks. A connection leak is a checkout that never returns: the ORM borrows a connection from the pool, the request finishes, but the connection is never closed or its transaction never committed, so the pool’s usable capacity shrinks by one. Leaks rarely fail loudly at first. The pool absorbs them until it runs dry, and the first visible symptom is a flood of acquisition timeouts hours into a deploy: HikariPool-1 - Connection is not available, request timed out after 30000ms or, on the Python side, QueuePool limit of size 5 overflow 10 reached, connection timed out, timeout 30.00. The pool is not undersized — it is being drained one orphaned connection at a time.
This page covers the symptoms that distinguish a leak from genuine saturation, leak detection across HikariCP, SQLAlchemy, and Go’s database/sql, correlation with pg_stat_activity idle-in-transaction rows, and the code-level fixes for missing close or commit.
Key operational takeaways:
- A leak shows as monotonically rising in-use connections that never return to baseline during idle periods; saturation rises and falls with traffic.
leakDetectionThreshold(HikariCP),echo_poolplus pool events (SQLAlchemy), andDBStats.InUseover time (Go) each expose the leaking call site.idle in transactionrows inpg_stat_activitythat persist for minutes are the server-side fingerprint of a missing commit or rollback.- The fix is almost always a missing
close()/commit()on an error path, not a larger pool — enlarging the pool only delays exhaustion.
Rapid incident diagnosis
The decisive question is whether in-use connections fall back to baseline when traffic stops. Plot active connections over a low-traffic window. Genuine saturation tracks load: it climbs during peaks and recedes overnight. A leak does not recede — the in-use count ratchets upward and stays there, because the leaked connections are checked out by threads that have long since finished. For the broader saturation-versus-leak signal taxonomy, see Detecting Connection Pool Saturation.
Check the timeout error first. Both stacks log a clear exhaustion line:
HikariPool-1 - Connection is not available, request timed out after 30000ms
(total=10, active=10, idle=0, waiting=8)
sqlalchemy.exc.TimeoutError: QueuePool limit of size 5 overflow 10 reached,
connection timed out, timeout 30.00
When active equals total and idle is 0 even though application throughput is near zero, you have a leak, not a load spike. Confirm from the database. Long-lived idle in transaction sessions are the canonical fingerprint — a connection was handed a transaction, the work finished, and neither commit nor rollback was issued:
SELECT pid, state, application_name,
now() - state_change AS idle_for,
left(query, 80) AS last_query
FROM pg_stat_activity
WHERE state IN ('idle in transaction', 'idle in transaction (aborted)')
ORDER BY idle_for DESC;
Rows with an idle_for of minutes or hours point straight at the leaking transaction. The last_query column usually names the statement that ran just before the orphaned connection was abandoned, which narrows the search to a specific code path.
Locating the leak by stack
Each runtime has a built-in mechanism that turns “a connection leaked somewhere” into a stack trace or a call site.
HikariCP. Set leakDetectionThreshold to a value just above your slowest legitimate query (for example 20000 ms). When a connection stays checked out longer than the threshold, HikariCP logs the full stack trace of the thread that borrowed it:
Connection leak detection triggered for org.postgresql.jdbc.PgConnection@4a2e...,
stack trace follows
at com.example.ReportDao.runExport(ReportDao.java:88)
...
That stack trace is the leaking call site. Leave detection on only long enough to capture the trace, then disable it — it adds tracking overhead on every checkout. (The same leakDetectionThreshold mechanism is detailed in the Spring Boot tuning guides.)
SQLAlchemy. Enable echo_pool="debug" on the engine to log every checkout and checkin, or attach pool event listeners that record the stack at checkout and warn if a connection is garbage-collected while still checked out:
from sqlalchemy import event
import traceback, logging
log = logging.getLogger("pool.leak")
@event.listens_for(engine, "checkout")
def _on_checkout(dbapi_conn, conn_record, conn_proxy):
conn_record.info["checkout_stack"] = traceback.format_stack()
@event.listens_for(engine, "checkin")
def _on_checkin(dbapi_conn, conn_record):
conn_record.info.pop("checkout_stack", None)
Connections whose checkout_stack is never cleared identify sessions opened and never returned. The reset_on_return behavior also matters: if a session leaves an open transaction, SQLAlchemy rolls it back on checkin — but only if the connection is actually returned. A leaked session never reaches checkin, so the rollback never happens and the row shows as idle in transaction on the server.
Go database/sql. Go has no per-checkout stack hook, so detect leaks by sampling DB.Stats() over time. InUse that grows without bound while Idle stays at zero is the signature of rows.Close() or tx.Commit()/tx.Rollback() never being called:
s := db.Stats()
log.Printf("open=%d inUse=%d idle=%d waitCount=%d waitDuration=%s",
s.OpenConnections, s.InUse, s.Idle, s.WaitCount, s.WaitDuration)
A steadily climbing WaitCount and WaitDuration alongside InUse pinned at MaxOpenConns confirms exhaustion driven by leaks. In Go the usual culprit is iterating sql.Rows and returning early on error without defer rows.Close(), which keeps the underlying connection checked out until the garbage collector eventually finalizes it — unreliably and far too late.
Sizing the detection threshold
leakDetectionThreshold and any “warn if checked out longer than” timer must sit above your slowest legitimate connection hold time, or you will drown in false positives from long-but-valid work. Compute the floor from your real query latency distribution.
Let p999_hold be the 99.9th-percentile time a connection is legitimately held — including transaction span, not just one query. Set the threshold to roughly 2 * p999_hold, rounded to a clean value.
Worked example. A service’s slowest legitimate transaction is a batch export holding a connection for 9 seconds at p999. Doubling gives 18 seconds; round to leakDetectionThreshold = 20000 ms. Anything held longer than 20 s is almost certainly leaked, because no honest path exceeds it. If the batch export were itself the leak suspect, you would instead move it to its own datasource with a separate pool so the threshold can be tightened on the interactive pool without flagging the batch work.
| Workload | p999 hold time | Suggested threshold |
|---|---|---|
| Interactive API (single query per request) | ~200 ms | 2000 ms |
| Read-heavy with short transactions | ~1 s | 5000 ms |
| Mixed with occasional batch | ~9 s | 20000 ms |
| Long ETL / report jobs | minutes | isolate in a separate pool |
Exact remediation and configuration
Every leak fix reduces to guaranteeing the connection is released on every path, including exceptions. Use the language’s scope-bound cleanup construct rather than a manual close in a single finally.
In Python, never hold a session past the request; bind it to a context manager so an exception still triggers checkin and rollback:
# Leak: session never closed if process_rows raises
session = Session()
rows = session.execute(select(Order)).all()
process_rows(rows) # raises -> session leaks, idle in transaction
# Fixed: context manager guarantees rollback + checkin on any exit
with Session() as session:
rows = session.execute(select(Order)).all()
process_rows(rows)
In Go, defer the close immediately after a successful open, before any error-prone work:
// Leak: early return skips rows.Close(), connection stays checked out
rows, err := db.Query("SELECT id FROM orders WHERE status = $1", "open")
if err != nil { return err }
for rows.Next() {
if err := scan(rows); err != nil { return err } // leaks the connection
}
// Fixed: defer runs on every return path
rows, err := db.Query("SELECT id FROM orders WHERE status = $1", "open")
if err != nil { return err }
defer rows.Close()
for rows.Next() {
if err := scan(rows); err != nil { return err }
}
For transactions, the same discipline applies to commit/rollback. A transaction that is begun but never resolved is the exact cause of idle in transaction rows. In Go, defer tx.Rollback() immediately after db.Begin() is safe — Rollback on an already-committed transaction is a no-op. In ORMs that own a per-request session, ensure the framework’s request-scoped teardown actually fires; a common production leak is a background or Celery task that escapes the request lifecycle and so never triggers the web framework’s automatic session close.
Apply the fix and deploy; do not raise the pool size as the remediation. A larger pool only changes how many hours pass before exhaustion returns. If you must buy time during an active incident, set a short idle_in_transaction_session_timeout on the server so PostgreSQL itself reaps orphaned transactions:
ALTER ROLE app_user SET idle_in_transaction_session_timeout = '60s';
This is a containment measure, not a fix — it converts a silent leak into a visible FATAL: terminating connection due to idle-in-transaction timeout, which is preferable to a drained pool but still means a code path is broken.
Validation and verification
After deploying the fix, prove the in-use count returns to baseline during an idle window. Capture pool stats before and after a quiet period; the in-use figure must fall back to minimum-idle (or Idle count in Go), not stay elevated.
Re-run the server-side leak query and confirm no long-lived idle in transaction rows remain:
SELECT count(*)
FROM pg_stat_activity
WHERE state = 'idle in transaction'
AND now() - state_change > interval '30 seconds';
A result of 0 after the fix, sustained across a full traffic cycle, confirms the leak is closed. Keep leakDetectionThreshold or the SQLAlchemy checkout listener active in staging during a soak test that drives the previously-leaking path under error injection — force the exception that used to skip cleanup and confirm no stack trace fires and no idle in transaction row appears.
Frequently Asked Questions
How do I tell a connection leak apart from normal pool saturation?
What does idle in transaction mean and why does it matter for leaks?
Is leakDetectionThreshold safe to leave on in production?
Why does my Go service leak connections even though I close the rows?
Will raising the pool size fix a leak?
Related
- ORM Connection Lifecycle Hooks — the parent topic on checkout, checkin, and reset events across ORMs.
- Preventing Django Connection Leaks During Celery Tasks — the background-task variant of this leak and its request-lifecycle fix.
- Detecting Connection Pool Saturation — distinguishing leak-driven exhaustion from load-driven saturation in metrics.