mountpoint-s3 Cache folder size keep increasing even max-cache-size exceeded

Mountpoint for Amazon S3 version

mount-s3 1.16.2

AWS Region

No response

Describe the running environment

N/A

Mountpoint options

N/A

What happened?

I have max-cache-size is around 400000 MiB. My traffics to read file is too high so when cache folder size exceeds the max-cache-size, it keeps increasing. I guess the cause is in the logic: "By default, Mountpoint will limit the maximum size of the local cache such that the free space on the file system does not fall below max-cache-size, and will automatically evict the least recently used content from the local cache when caching new content." If the item evicted is smaller than the new cache content, the cache folder continue to increasing its size.

I wonder what will happen if the cache folder size exceeds available size of the disk? Will the mount-s3 crash or mal-function in that case because of cache failure?

Relevant log output

Apr 28 '25 02:04 nguyenminhdungpg

@dannycjones this is urgent, could you give me some quick response. Thank you.

Apr 28 '25 03:04 nguyenminhdungpg

Hi @nguyenminhdungpg, thanks for reporting this. We don't expect Mountpoint to crash when the disk is full, but it may result in downgraded performance due to errors on writing/reading from cache. I can see how Mountpoint using more space than configured can be an issue and happy to investigate that. A couple of questions:

How much beyond the limit of 400GiB do you observe Mountpoint to use for cache?
Does disk usage keeps increasing or saturates at some value above the configured limit?
Do you observe any errors in logs, in particular in regards to cache block eviction (unable to evict block)?

Please also share Mountpoint debug logs capturing the moment when Mountpoint uses more space than configured so we can investigate the issue further (may be enabled with --debug CLI flag).

In regards to If the item evicted is smaller than the new cache content, the cache folder continue to increasing its size., that shouldn't be an issue, since Mountpoint evict multiple blocks until the limit is met again: https://github.com/awslabs/mountpoint-s3/blob/main/mountpoint-s3-fs/src/data_cache/disk_data_cache.rs#L330

Apr 28 '25 15:04 vladem

Hi @vladem , thank you very much.

Let me tell you more detail info. My system has a very high traffics to read S3 file, millions per 24hour. In front of mount-s3, I have several layer of cache, so the reading requests sent to mount-s3 is decreased to around 1 -> 2 millions per day. (uploaded files are saved into S3 bucket in another sub-system, 100 thousands files per day). In earlier time, I routed only a small percent of all traffics to mount-s3 and mount-s3 managed to saturate some value above the configured limit (default 5% free space or 300000 MiB of 400GB EBS volume). As I saw it was working fine, I decided to configure 100% traffics to mount-s3 with a larger EBS volume (500GB) and max-cache-size is 450000. Let's look at this:

`~# while true; do echo $(date && df -k /dev/nvme1n1p1 |tail -1) && sleep 2; done Tue Apr 29 09:23:25 +07 2025 /dev/nvme1n1p1 514935040 491816784 0 100% /mnt/s3-mountpoint-cache

Tue Apr 29 09:23:27 +07 2025 /dev/nvme1n1p1 514935040 491816172 0 100% /mnt/s3-mountpoint-cache

Tue Apr 29 09:23:30 +07 2025 /dev/nvme1n1p1 514935040 491817008 0 100% /mnt/s3-mountpoint-cache

Tue Apr 29 09:23:32 +07 2025 /dev/nvme1n1p1 514935040 491816936 0 100% /mnt/s3-mountpoint-cache

Tue Apr 29 09:23:34 +07 2025 /dev/nvme1n1p1 514935040 491816876 0 100% /mnt/s3-mountpoint-cache

Tue Apr 29 09:23:36 +07 2025 /dev/nvme1n1p1 514935040 491816848 0 100% /mnt/s3-mountpoint-cache

Tue Apr 29 09:23:38 +07 2025 /dev/nvme1n1p1 514935040 491816580 0 100% /mnt/s3-mountpoint-cache

Tue Apr 29 09:23:40 +07 2025 /dev/nvme1n1p1 514935040 491817176 0 100% /mnt/s3-mountpoint-cache

Tue Apr 29 09:23:42 +07 2025 /dev/nvme1n1p1 514935040 491817128 0 100% /mnt/s3-mountpoint-cache

Tue Apr 29 09:23:44 +07 2025 /dev/nvme1n1p1 514935040 491817264 0 100% /mnt/s3-mountpoint-cache

Tue Apr 29 09:23:46 +07 2025 /dev/nvme1n1p1 514935040 491816876 0 100% /mnt/s3-mountpoint-cache

Tue Apr 29 09:23:48 +07 2025 /dev/nvme1n1p1 514935040 491817100 0 100% /mnt/s3-mountpoint-cache

Tue Apr 29 09:23:50 +07 2025 /dev/nvme1n1p1 514935040 491816928 0 100% /mnt/s3-mountpoint-cache

Tue Apr 29 09:23:52 +07 2025 /dev/nvme1n1p1 514935040 491817176 0 100% /mnt/s3-mountpoint-cache

Tue Apr 29 09:23:54 +07 2025 /dev/nvme1n1p1 514935040 491816708 0 100% /mnt/s3-mountpoint-cache

Tue Apr 29 09:23:56 +07 2025 /dev/nvme1n1p1 514935040 491817432 0 100% /mnt/s3-mountpoint-cache

Tue Apr 29 09:23:58 +07 2025 /dev/nvme1n1p1 514935040 491817060 0 100% /mnt/s3-mountpoint-cache

Tue Apr 29 09:24:00 +07 2025 /dev/nvme1n1p1 514935040 491817312 0 100% /mnt/s3-mountpoint-cache

Tue Apr 29 09:24:02 +07 2025 /dev/nvme1n1p1 514935040 491816984 0 100% /mnt/s3-mountpoint-cache

Tue Apr 29 09:24:04 +07 2025 /dev/nvme1n1p1 514935040 491817460 0 100% /mnt/s3-mountpoint-cache

Tue Apr 29 09:24:06 +07 2025 /dev/nvme1n1p1 514935040 491817216 0 100% /mnt/s3-mountpoint-cache

Tue Apr 29 09:24:08 +07 2025 /dev/nvme1n1p1 514935040 491817368 0 100% /mnt/s3-mountpoint-cache

Tue Apr 29 09:24:10 +07 2025 /dev/nvme1n1p1 514935040 491817584 0 100% /mnt/s3-mountpoint-cache

Tue Apr 29 09:24:12 +07 2025 /dev/nvme1n1p1 514935040 491817228 0 100% /mnt/s3-mountpoint-cache

Tue Apr 29 09:24:14 +07 2025 /dev/nvme1n1p1 514935040 491817708 0 100% /mnt/s3-mountpoint-cache

Tue Apr 29 09:24:16 +07 2025 /dev/nvme1n1p1 514935040 491817388 0 100% /mnt/s3-mountpoint-cache Tue Apr 29 09:24:18 +07 2025 /dev/nvme1n1p1 514935040 491817684 0 100% /mnt/s3-mountpoint-cache`

You can see mount-s3 is trying to saturate the limit but with a extremely high traffics, the cache folder size keep increasing. 450000 MiB equals to 471859200 KB, so the cache folder size exceeds the limit too far and soon reaching the volume size.

I can't switch to debug log for my mount-s3, with the high traffics, mount-s3 reaches GBs log within less than a day, the logs are synced to CloudWatch and cost me a lot. Besides it, with around 500GB cache folder, it takes time for me to remove it for running mount-s3 to re-mount and causes a long downtime. I am setting WARN log for it, and the log contains only errors of reading cache like this:

`2025-04-29T02:30:55.160226Z WARN mountpoint_s3_fs::data_cache::disk_data_cache: block could not be deserialized: Io(Error { kind: UnexpectedEof, message: "failed to fill whole buffer" })

2025-04-29T02:30:55.160424Z WARN mountpoint_s3_fs::prefetch::caching_stream: error reading block from cache, falling back to S3 cache_key=ObjectId { key: "-----", etag: ETag(""-----"") } block_index=0 range=151552..633797 out of 633797 error=InvalidBlockContent

2025-04-29T02:31:04.316372Z WARN mountpoint_s3_fs::prefetch::caching_stream: error reading block from cache, falling back to S3 cache_key=ObjectId { key: "-----", etag: ETag(""-----"") } block_index=0 range=65536..614590 out of 614590 error=IoFailure(failed to fill whole buffer)

2025-04-29T02:40:30.700485Z WARN mountpoint_s3_fs::prefetch::caching_stream: error reading block from cache, falling back to S3 cache_key=ObjectId { key: "-----", etag: ETag(""-----"") } block_index=0 range=65536..149332 out of 149332 error=IoFailure(failed to fill whole buffer)

2025-04-29T02:42:34.794958Z WARN mountpoint_s3_fs::data_cache::disk_data_cache: block could not be deserialized: Io(Error { kind: UnexpectedEof, message: "failed to fill whole buffer" })

2025-04-29T02:42:34.795626Z WARN mountpoint_s3_fs::prefetch::caching_stream: error reading block from cache, falling back to S3 cache_key=ObjectId { key: "-----", etag: ETag(""-----"") } block_index=0 range=8192..124104 out of 124104 error=InvalidBlockContent

2025-04-29T02:43:12.350523Z WARN mountpoint_s3_fs::data_cache::disk_data_cache: block could not be deserialized: Io(Error { kind: UnexpectedEof, message: "failed to fill whole buffer" })

2025-04-29T02:43:12.350773Z WARN mountpoint_s3_fs::prefetch::caching_stream: error reading block from cache, falling back to S3 cache_key=ObjectId { key: "-----", etag: ETag(""-----"") } block_index=0 range=65536..777697 out of 777697 error=InvalidBlockContent`

I mask some values because of my organization rules.

Apr 29 '25 02:04 nguyenminhdungpg

Thanks for providing additional information. I've extracted the following from your message:

Disk type: EBS volume Total disk capacity: 500GB (~491 GiB available for data) Configured max cache size: 450,000 MiB (~439 GiB) Reported disk usage: 480,290 MiB (~469 GiB) (491817684 KiB)

Am I correct in assuming that Mountpoint is using about 30 GiB more than the configured cache size?
Is the EBS volume used exclusively by Mountpoint for cache or some other process may take up space? (if not, consider using du -sh /path/to/cache-dir to estimate the disk usage)
Could you provide the full command that you use to launch Mountpoint please?
Do you observe any No space left on device errors in Mountpoint logs?

Apr 29 '25 12:04 vladem

Hi @vladem , I give you information below

Yes, and it keep increasing, slowly but increasing. I guess OS spends 10% for reservation so df -k command shows total volume size is only 514935040 KB.
Yes, the EBS volume is used only for mount-s3 cache. The EC2 root volume is for OS and other apps running. Cache folder is big so the du command runs too long, I can't wait, but I'm sure the EBS volume is only for mount-s3 cache. mount -t ext4 /dev/nvme1n1p1 /mnt/s3-mountpoint-cache
my command is /usr/bin/mount-s3 the_bucket /mnt/s3-mountpoint-data --region=ap-southeast-1 --allow-other --allow-delete --dir-mode=0777 --file-mode=0644 --log-directory /var/log/mountpoint-s3 --cache /mnt/s3-mountpoint-cache --metadata-ttl 15552000 --negative-metadata-ttl 300 --max-cache-size 450000
No sir: `~# cat /var/log/mountpoint-s3/mountpoint-s3-2025-04-17T03-41-59Z.xduBcm.log-20250419 | grep 'space'

~#`

Apr 29 '25 13:04 nguyenminhdungpg

Hi @vladem , @passaro , I think there is something wrong with fn evict_if_needed(), you use UsageInfo to keep track of cache items metadata and evict the lru item while put/delete files in others separated functions. I think with high traffics and high frequently concurrent accesses, it will result to produce a lot of orphan files in cache folder. So the UsageInfo's size is always below limit but physical cache folder keep increasing its size. I think you should add a way to clean untracked or orphan files in cache folder. E.g. add a field named removed_entries into UsageInfo, and a task periodically scans this to delete "removed" files in case they still existing in cache folder.

May 06 '25 06:05 nguyenminhdungpg

Hi, I agree it's possible that files can become orphaned in the evict_if_needed function, if we fail to remove them for some reason other than they didn't exist, which we should likely fix. However, we're also trying to understand if this is what happened in your case: please can you confirm if you see the "unable to evict block" warning in your logs?

I'm wondering if there is also some different problem here, where we add a block but don't track it. If you don't see the "unable to evict" message, it might rule out problems on removal as the cause.

Here's a sample test case that demonstrates the file removal problem by changing permissions to cause removal to fail, which demonstrates the problem I think you're talking about:

    #[tokio::test]
    async fn test_eviction_with_file_removal_failure() {
        // GIVEN: a full cache where object 1 block 1 is the LRU:
        // Create a cache with a small size limit
        const BLOCK_SIZE: usize = 100 * 1024; // 100 KB
        const CACHE_LIMIT: usize = 3 * BLOCK_SIZE;

        let cache_directory = tempfile::tempdir().unwrap();
        let cache_path = cache_directory.path().to_path_buf();

        let cache = DiskDataCache::new(DiskDataCacheConfig {
            cache_directory: cache_path.clone(),
            block_size: BLOCK_SIZE as u64,
            limit: CacheLimit::TotalSize { max_size: CACHE_LIMIT },
        });

        // Put 3 blocks:
        let object = ObjectId::new("object1".into(), ETag::for_tests());
        let data = ChecksummedBytes::new(vec![1u8; BLOCK_SIZE].into());
        for block in 0..3 {
            cache.put_block(object.clone(), block, block * BLOCK_SIZE as u64, data.clone(), BLOCK_SIZE * 3).await.unwrap();
        }

        // Test validation: Verify all blocks are in the cache
        for block in 0..3 {
            assert!(cache.get_block(&object, block, block * BLOCK_SIZE as u64, 3 * BLOCK_SIZE).await.unwrap().is_some());
        }

        // WHEN: We add a fourth block and fail to evict a prior block's data:
        // Make the directory containing object1's block read-only to simulate file removal failure
        let block_key1 = DiskBlockKey::new(&object, 0);
        let path1 = cache.get_path_for_block_key(&block_key1);
        let parent_dir = path1.parent().unwrap();
        let mut perms = fs::metadata(parent_dir).unwrap().permissions();
        perms.set_readonly(true);
        fs::set_permissions(parent_dir, perms).unwrap();

        // Now try to add a 4th block, which should trigger eviction
        // The eviction of object1's block will fail
        let object2 = ObjectId::new("object2".into(), ETag::for_tests());
        cache.put_block(object2.clone(), 0, 0, data.clone(), BLOCK_SIZE).await.unwrap();

        // Make the directory writable again so we can check if files exist
        let mut perms = fs::metadata(parent_dir).unwrap().permissions();
        perms.set_readonly(false);
        fs::set_permissions(parent_dir, perms).unwrap();

        // Test validity: Check if object1's block is still on disk despite being evicted from tracking
        for block in 0..3 {
            let block_path = cache.get_path_for_block_key(&DiskBlockKey::new(&object, block));
            assert!(block_path.exists(), "Block file should still exist on disk");
        }

        // THEN: The cache size on disk and in tracking should match.
        // Calculate actual disk usage by checking all files in the cache directory
        let actual_disk_usage: usize = walkdir::WalkDir::new(&cache_path)
            .into_iter()
            .filter_map(Result::ok)
            .filter(|e| e.file_type().is_file())
            .map(|e| (e.metadata().unwrap().len() as usize))
            .sum();

        // Get the tracked usage from the cache
        let tracked_usage = cache.usage.expect("cache was configured with size").lock().unwrap().size;

        // The actual disk usage should be equal the tracked usage
        assert!(
            actual_disk_usage == tracked_usage,
            "Actual disk usage ({}) should equal the tracked usage ({})",
            actual_disk_usage,
            tracked_usage
        );

        // The tracked usage should be approximately 4 * BLOCK_SIZE (plus some overhead)
        // because we've added 4 blocks but evicted 1 from tracking
        assert!(
            tracked_usage >= 4 * BLOCK_SIZE,
            "Tracked usage ({}) should be at least 4 * BLOCK_SIZE ({})",
            tracked_usage,
            4 * BLOCK_SIZE
        );

        // The actual disk usage should be approximately 4 * BLOCK_SIZE (plus some overhead)
        // because all 4 blocks are still on disk
        assert!(
            actual_disk_usage >= 4 * BLOCK_SIZE,
            "Actual disk usage ({}) should be at least 4 * BLOCK_SIZE ({})",
            actual_disk_usage,
            4 * BLOCK_SIZE
        );
    }

May 07 '25 14:05 adpeace

Hi @adpeace , I am setting log level to WARN and there is only log entries like this:

2025-04-29T02:42:34.794958Z WARN mountpoint_s3_fs::data_cache::disk_data_cache: block could not be deserialized: Io(Error { kind: UnexpectedEof, message: "failed to fill whole buffer" })
2025-04-29T02:42:34.795626Z WARN mountpoint_s3_fs::prefetch::caching_stream: error reading block from cache, falling back to S3 cache_key=ObjectId { key: "-----", etag: ETag(""-----"") } block_index=0 range=8192..124104 out of 124104 error=InvalidBlockContent

So I think normal flows work fine but there might be some hidden wrong. The code only locks each UsageInfo function call, but in high concurrent flows, it might not work properly. I'm trying to figure out a flow that make a block to be untracked.

The cache had been 2GB to reach volume capacity, so I had restarted mount-s3 2 day ago. The cache size is now about to reach the max-cache-size (90%). It increases very fast (10-20MB/s). But when it was in exceeded phase several day ago, I saw it increases slowly (several hundreds KB/s) and sometime decreases. This time I reduce max-cache-size to 400000MiB, so the slow increase will buy me more time until the next time I restart mount-s3 again.

I think with the concurrent accesses are very high and fast, there might be in some quick moments, other flows break the evict_if_needed flow silently. I guess that with about 3000 read reqs to mount-s3 per minute, when cache is in exceeded state (indicated by UsageInfo instance), the evict_lru invocations are high and finished fast, but remove_file might runs a bit slower. And huge amount of incoming blocks are being saved to cache folder. So the metadata in UsageInfo always be behind the physical files on the disk, including orphan files. => evict_if_needed still somehow does its role but failed to prevent exceeding issue.

May 08 '25 04:05 nguyenminhdungpg