pghoard icon indicating copy to clipboard operation
pghoard copied to clipboard

Download of basebackup always stalls

Open ilicmilan opened this issue 5 years ago • 4 comments

Hello,

I'm facing the issue where I'm not able to download a basebackup using pghoard_restore command since download always stalls.

Restore command:

sudo -u postgres pghoard_restore get-basebackup --config pghoard.json --restore-to-master --overwrite --target-dir /var/lib/pgsql/9.5/data-new/

The appropriate backup was selected, but nothing happens. ps auxf shows that pghoard_restore creates 9 additional processes but the download progress is constantly 0%, which after 3 x 2 minutes turns to fail.

Command output:

Found 1 applicable basebackup 

Basebackup                                Backup size    Orig size  Start time          
----------------------------------------  -----------  -----------  --------------------
server-f-postgres-01/basebackup/2019-07-10_12-27_0.00000000.pghoard     13245 MB     35432 MB  2019-07-10T12:27:32Z
    metadata: {'compression-algorithm': 'snappy', 'format': 'pghoard-bb-v2', 'original-file-size': '81920', 'host': 'server-f-postgres-01', 'end-time': '2019-07-10 14:33:12.657815+02:00', 'end-wal-segment': '000000010000001A0000004A', 'pg-version': '90518', 'start-wal-segment': '000000010000001A00000048', 'total-size-plain': '37153730560', 'total-size-enc': '13888641735'}

Selecting 'server-f-postgres-01/basebackup/2019-07-10_12-27_0.00000000.pghoard' for restore
2019-07-10 15:20:34,941%BasebackupFetcher       MainThread      ERROR   Download stalled for 120.43377648199385 seconds, aborting downloaders
2019-07-10 15:22:35,674%BasebackupFetcher       MainThread      ERROR   Download stalled for 120.44614975301374 seconds, aborting downloaders
2019-07-10 15:24:36,392%BasebackupFetcher       MainThread      ERROR   Download stalled for 120.47685114300111 seconds, aborting downloaders
2019-07-10 15:24:36,612 BasebackupFetcher       MainThread      ERROR   Download stalled despite retries, aborting
FATAL: RestoreError: Backup download/extraction failed with 1 errors

pghoard.conf:

{
    "backup_location": "./metadata",
    "backup_sites": {
        "server-f-postgres-01": {
            "active_backup_mode": "pg_receivexlog",
            "basebackup_mode": "local-tar",
            "basebackup_chunks_in_progress": 5,
            "basebackup_chunk_size": 2147483648,
            "basebackup_hour": 5,
            "basebackup_interval_hours": 24,
            "basebackup_minute": 40,
            "pg_data_directory": "/var/lib/pgsql/9.5/data",
            "nodes": [
                {
                    "host": "127.0.0.1",
                    "user": "postgres",
                    "password": "secret",
                    "port": 5432
                }
            ],
            "object_storage": {
                "storage_type": "google",
                "project_id": "postgres-dev",
                "bucket_name": "test-pghoard"
            }
        }
    }
}

ilicmilan avatar Jul 10 '19 13:07 ilicmilan

Hi, I have the same problem in pghoard 2.1.0. Any tips to solve?

2020-10-16 11:00:09,131%BasebackupFetcher MainThread ERROR Download stalled for 120.13373475382105 seconds, aborting downloader

Thanks.

eriveltonvichroski avatar Oct 16 '20 14:10 eriveltonvichroski

There shouldn't be any generic issue with this as we've done very large amount of restorations across all major cloud providers and haven't been seeing this. If this is reproducible then you should check out what's happening on network level.

rikonen avatar Oct 19 '20 06:10 rikonen

Hi,

On the line https://github.com/aiven/pghoard/blob/master/pghoard/rohmu/object_storage/google.py#L60

# googleapiclient download performs some 3-4 times better with 50 MB chunk size than 5 MB chunk size;
# but decrypting/decompressing big chunks needs a lot of memory so use smaller chunks on systems with less
# than 2 GB RAM
DOWNLOAD_CHUNK_SIZE = 1024 * 1024 * 5 if get_total_memory() < 2048 else 1024 * 1024 * 50
UPLOAD_CHUNK_SIZE = 1024 * 1024 * 5

Debugging, including on a machine/network in the CGP itself, I realized that the problem occurs when a machine has> 2 GB of RAM, because enters the condition "if get_total_memory () <2048 else 1024 * 1024 * 50"

DOWNLOAD_CHUNK_SIZE = 1024 * 1024 * 5 if get_total_memory () <2048 else 1024 * 1024 * 50

That is, the problem occurs when DOWNLOAD_CHUNK_SIZE = 50MB

First I tested with DOWNLOAD_CHUNK_SIZE = 1024 * 1024 * 5 and the download was successful!

The maximum value that the download works is DOWNLOAD_CHUNK_SIZE = 1024 * 1024 * 25, that is, 25 MB

Is there an automated test that runs on a machine with> 2GB of RAM?

Cheers

eriveltonvichroski avatar Oct 20 '20 01:10 eriveltonvichroski

Is there an automated test that runs on a machine with> 2GB of RAM?

Yes.

It would probably make sense to add an optional configuration parameter that can be used to set the chunk size. 50 MiB performs better than 5 MiB so it is preferable when download performance is important and as mentioned we haven't seen any issues with this but in general 50 MiB is fairly large chunk size and setting smaller one via config would be reasonable, especially if the machine is otherwise somehow memory constrained.

rikonen avatar Oct 20 '20 05:10 rikonen