Memory Full

October 4, 2025

On Sunday morning, I got lot of messages that production was down. Users are not able to login. Naturally, I went to Sentry to check if any alerts were raised. And there seemed to be a connection issue with Postgres database. I went to the db dashboard and it flagged that the RDS memory had exhausted. It had 1TB memory provisioned and that was exhausted. To resolve the problem immediately I provisioned more memory to it.

After few weeks, I looked into the issue yesterday. Where is 1TB memory going ? We have only around 10k users and that shouldn’t take so much memory. I started debugging DB stats.

from django.db import connection
def check_db_size():
    with connection.cursor() as cursor:
        cursor.execute("""
            SELECT 
                schemaname AS database,
                ROUND(SUM(pg_total_relation_size(schemaname||'.'||tablename)) / 1024 / 1024 / 1024::numeric, 2) AS size_gb
            FROM pg_tables
            WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
            GROUP BY schemaname
            ORDER BY size_gb DESC
        """)
        results = cursor.fetchall()
        for row in results:
            print(f"Schema: {row[0]}, Size: {row[1]} GB")

DB size seemed quite normal. It came below 1GB. That is interesting. Claude suggested to check WAL size.

def check_wal_size():
    with connection.cursor() as cursor:
        cursor.execute("""
            SELECT 
                pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), '0/0')) as wal_size,
                count(*) as wal_files 
            FROM pg_ls_waldir()
        """)
        result = cursor.fetchone()
        print(f"WAL Size: {result}")

To my surprise, it showed that 258GB was being hogged by one replication slot that was inactive. That got us more into it. We dug deeper into WAL stats.

def deep_wal_investigation_rds():
    with connection.cursor() as cursor:
        # Check ALL replication slots
        cursor.execute("""
            SELECT 
                slot_name,
                slot_type,
                active,
                xmin,
                restart_lsn,
                pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) as retained
            FROM pg_replication_slots
            ORDER BY pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) DESC NULLS LAST
        """)
        print("ALL REPLICATION SLOTS:")
        for row in cursor.fetchall():
            print(f"  {row[0][:30]}... Active={row[2]}, Retained={row[5]}")
        # Check WAL retention settings
        cursor.execute("""
            SELECT 
                name, 
                setting,
                unit
            FROM pg_settings 
            WHERE name IN (
                'wal_keep_segments', 
                'wal_keep_size', 
                'max_wal_size', 
                'min_wal_size',
                'checkpoint_timeout',
                'checkpoint_completion_target',
                'archive_mode',
                'archive_timeout'
            )
            ORDER BY name
        """)
        print("\nWAL CONFIGURATION:")
        for row in cursor.fetchall():
            print(f"  {row[0]}: {row[1]} {row[2] if row[2] else ''}")
        # Check current WAL position
        cursor.execute("""
            SELECT 
                pg_current_wal_lsn() as current_lsn,
                pg_walfile_name(pg_current_wal_lsn()) as current_wal_file
        """)
        wal_pos = cursor.fetchone()
        print(f"\nCURRENT WAL POSITION:")
        print(f"  LSN: {wal_pos[0]}")
        print(f"  File: {wal_pos[1]}")
        # Check oldest WAL required
        cursor.execute("""
            SELECT 
                MIN(restart_lsn) as oldest_required_lsn,
                pg_walfile_name(MIN(restart_lsn)) as oldest_required_file
            FROM pg_replication_slots
            WHERE restart_lsn IS NOT NULL
        """)
        oldest = cursor.fetchone()
        if oldest[0]:
            print(f"\nOLDEST WAL REQUIRED BY SLOTS:")
            print(f"  LSN: {oldest[0]}")
            print(f"  File: {oldest[1]}")

From the investigation it came that, current WAL position was 38E/A801DF30 and the oldest required position by replication slots was 34E/8008F00. So, the oldest WAL was around 40 (hex) segments behind, i.e. 64 segments behind. Each slot occupying around 16GB, total came out around 4TB.

For the context, we have CDC pipeline setup on the RDS using DMS service of AWS. For some reason an old replication had gone down, so the maintainer created another replication slot without cleaning the older one. This resulted in memory being hogged by the inactive replication slot. While we actually used a GB of memory, we were paying for TB of memory due to this cleanup issue.

I was able to debug the issue with Claude in an hour. I was just thinking how many days it would have taken me to just find out WAL stats…