A Case Study: Diagnosing a Stalled MongoDB TTL Monitor

SUMMARY:

A comprehensive health check of a MongoDB replica set revealed that severe performance degradation was caused by a stalled TTL (Time-To-Live) Monitor on an undersized cluster, which was successfully resolved by scaling compute instances and manually purging expired data.

Analysis of system logs identified the core issue as a stalled TTLMonitor process that had failed to delete expired documents for months, causing the collection size to balloon out of control.
The root cause was an undersized cluster, with EC2 instances equipped with only 2 CPUs and 15GB of RAM, overwhelmed by a massive backlog of 900 million documents.
To restore system stability, the team vertically scaled the compute instances to provide the necessary CPU and RAM resources to handle the heavy write workload.
Engineers executed a manual cleanup operation using a JavaScript script to delete the backlog of expired documents in controlled batches, allowing the TTL monitor to eventually resume normal operation.

This case highlights the necessity of correctly sizing database infrastructure to match data volume and the importance of regular health checks to detect silent failures, such as stalled background processes.

SUMMARY:
The Health Check: A Holistic Approach
The Culprit: A Stalled TTL Monitor
The Root Cause: An Undersized Cluster
The Solution: Right-Sizing and Manual Cleanup

Recently, I was asked to help a client who was experiencing slow performance on their 3-node MongoDB 4.4.10 replica set. The instance, which consisted of a primary and two secondary replicas, had been running smoothly until this year, when users began noticing a significant slowdown that worsened each month. My approach was to perform a comprehensive health check, starting with the operating system and working down to the MongoDB instance.

The Health Check: A Holistic Approach

My first step was to examine the operating system. I checked basic Linux resource utilization using commands such as top and vmstat, and reviewed OS-level limits to ensure there were no bottlenecks. I then moved on to the MongoDB configuration, checked the serverStatus and currentOp outputs, and analyzed the logs for any slow queries.

I found that while most of the workload consisted of inserts and aggregation queries—which are typically very efficient—there were also many long-running operations. This pointed toward a deeper, underlying issue. My investigation led me to the dmesg logs, where I found a series of “page allocation” errors, a strong indicator of memory pressure on the server.

The Culprit: A Stalled TTL Monitor

As I continued to drill down, the real culprit became clear: the TTLMonitor process was stalled. The TTLMonitor is a background thread in MongoDB that automatically deletes expired documents from collections with a TTL (Time-To-Live) index. All of my client’s collections used TTL indexes that expired documents after 21 days.

However, the TTLMonitor The pass counter had not incremented in months, indicating the process had completely stopped. This meant that expired documents were not being deleted, causing collections to grow far beyond their intended size. The stalled TTL monitor was consuming high CPU resources, likely due to a futile attempt to scan and delete documents from an ever-growing dataset.

The Root Cause: An Undersized Cluster

The page allocation errors in the logs and the high CPU usage, averaging 180%, pointed to a fundamental problem: the EC2 instances were undersized for the client’s workload. The servers were configured with only 2 CPUs and 15GB of RAM, which was completely insufficient for a database with 900,000,000 documents (in a single collection alone) and a heavy write load. The sheer volume of data and the ongoing workload were overwhelming the available server resources.

The TTL monitor was the first domino to fall. When it failed, the collection size ballooned, which in turn made other queries and operations even slower due to increased data size, creating a vicious cycle of performance degradation.

The Solution: Right-Sizing and Manual Cleanup

Based on this diagnosis, I had two key recommendations:

Increase Compute Instance Size: The primary step was to scale up the EC2 instance type to provide sufficient CPU and RAM for the current workload. This would prevent the system from getting into a resource-starved state again and allow the TTL monitor to catch up and function correctly.
Manual Cleanup of Expired Documents: Because the TTL monitor had been stalled for months, a massive backlog of expired documents had built up. Attempting to restart the TTL monitor would likely cause it to stall again due to the sheer number of documents to delete. My approach was to run a JavaScript script in the MongoDB shell to manually delete these expired documents in small, controlled batches. This method allowed us to safely clear the backlog without overwhelming the primary server’s resources.

By following this two-part strategy, we addressed both the immediate problem and the underlying cause, putting the client’s MongoDB instance back on a healthy path.

If your MongoDB system isn’t performing as well as it once did, it is time for a comprehensive health check. Please contact the XTIVIA Virtual DBA team to discuss how we can help.