pyfilesystem2
pyfilesystem2 copied to clipboard
Suggestion: Optional Size Limits
I would like to suggest that it would be a valuable enhancement to add the ability to set & query size limits, especially for the memory & temp file systems. This would be especially useful for tests that could potentially overload the system. I can see it also being useful on things like AWS file systems where exceeding a given size may have financial consequences. It would also be very nice to be able to query the available space on just about every file system, e.g. if a zipfile has been created without zip64 support and adding to it would cause that limit to be exceeded.
It's a tricky proposition. It's not always possible to get used space from a filesystem in any meaningful sense. And on some filesystems it may require an expensive scan. Even if you know the file size of all the files, you might not know how much of the physical storage is used.
It may be possible to report such information via meta, with the caveat that some filesystems won't be able to return a value.
My main concern is on the memory file system it would be possible to crash the system if, e.g. you have 16 GB of RAM and try to write 20 GB of files.
That would likely work, assuming you have enough virtual memory. But yeah, you will have to guard against MemoryError exceptions if you think there is a chance of running out of memory.
If you are trying to isolate a test case that might have run-away file usage being able to set a limit would be very handy.
It probably would, but we can't add anything to the filesystem interface that couldn't be supported everywhere.
If you want to get the total file size in a filesystem you could do this:
bytes_used = my_filesystem.glob("**/*").count().data
For MemoryFS that should be reasonably fast.
It probably would, but we can't add anything to the filesystem interface that couldn't be supported everywhere.
I agree. But I guess that (unlike other FSes) MemoryFS and TempFS are guaranteed to be empty when first constructed, so theoretically they should be able to keep track of how many bytes get written into them? Which (again theoretically) means they could prevent writes when that total goes over some predefined limit? (although I guess if you wanted to impose a hard limit you'd also need to count the number of bytes you're going to write before actually writing them, which may incur an overhead? :man_shrugging: )
Of course you could still access a TempFS's syspath and copy files into it from outside PyFilesystem... :wink:
The thing that is currently missing is the facility to set a hard limit on the size. I would like to be able to do something like:
Import fs
mem_fs = fs.open_fs('mem://')
mem_fs.set_maxsize(10) # Limit of 10 MB
with mem_fs.openbin(‘bigfile.bin’, ‘wb’) as outfile:
oufile.write(b’x00’ * 1024 * 1024 *9) # Write a 9 MB file
with mem_fs.openbin(‘bigfile.bin’, ‘wb’) as outfile:
oufile.write(b’x00’ * 1024 * 1024 *2) # Try to write a 2 MB file
Exception Raised!
From: Will McGugan [email protected] Sent: 30 May 2020 13:15 To: PyFilesystem/pyfilesystem2 [email protected] Cc: Steve (Gadget) Barnes [email protected]; Author [email protected] Subject: Re: [PyFilesystem/pyfilesystem2] Suggestion: Optional Size Limits (#401)
It probably would, but we can't add anything to the filesystem interface that couldn't be supported everywhere.
If you want to get the total file size in a filesystem you could do this:
bytes_used = my_filesystem.glob("**/*").count().data
For MemoryFS that should be reasonably fast.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/PyFilesystem/pyfilesystem2/issues/401#issuecomment-636322804, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABKVUWQ5EOES3Y5IGFBIEMLRUD2F3ANCNFSM4NOTKTQQ.
mem_fs = fs.open_fs('mem://') mem_fs.set_maxsize(10) # Limit of 10 MB
I'd vote for it being a constructor-parameter rather than a method, and for the granularity being in bytes rather than megabytes :wink:
mem_fs = MemoryFS(max_size=10 * 1024 * 1024)
It's definitely possible with MemFS and as @lurch pointed out TempFS too with the caveat that its possible for files to be written outside of PyFilesystem.
But it could never be a generic thing, as its impractical for other filesystems. Getting the initial count could be very expensive.
TBH MemFS is probably the only real contender for this. Implementation for TempFS would be a challenge. MemFS could take a max_size argument as suggested, and we could use the already defined InsufficientStorage exception.
Actually thinking about it setting a maximum writable quota for other all file systems, and ignoring the possibilities of:
- The actual file system may not have the space available for all of that maximum to be written, (dispenses with the need to check current free space).
- On some file systems other processes could add additional files/content that make the total consumed on that file system exceed the quota dispenses with the need to monitor the file system.
- Also ignoring that other processes could potentially remove content from the file system simplifies things again no need to monitor.
This would give a possible "max process write quota" that just needs to accumulate bytes written and handle copy & delete functions which should be relatively simple and low overhead.
I think that this could potentially be useful on all file systems but has especial benefits for:
- Memory
- Temporary
- Chargeable File Systems (such as AWS, etc.)
- Possibly when there are charges by the amount of data transferred - e.g. over mobile networks.
I think that if this feature were added for "all" filesystems (as you're suggesting), points 2 and 3 in your list above would effectively render it moot. People would get confused (and/or upset) if they'd set some limit in PyFilesystem, but then other non-pyfilesystem processes pushed the FS usage over that limit.
Of course what you could do (and IMHO might be a better option) would be to create a WrapFS that counts the number of bytes written to the wrapped filesystem? I think that's more flexible and also makes it clearer that it's not a "builtin feature" ? If you wanted to get fancy you could even extend your WrapFS to remove bytes from your "quota" when files get deleted? :wink: (which obviously adds the small overhead of reading a file's size before deleting it, which may get expensive when recursively deleting large directories?)
AWS extended max Lambda memory to 10 GB in December 2020. As Lambda does not provide the possibility to mount tmpfs (where one can set the max size), pyfilesystem becomes the primary cost-effective way to work with large files (the next-best option is EFS which costs an arm and a leg to get decent performance).
The size limitation functionality becomes key as the Lambda function is silently killed if it exceeds its memory.
I vote for this feature and agree with the discussion so far that this only makes sense for MemFS. However, it is a very important and integral part of MemFS.
@pspot2 I filed a ticket for something like this a while ago and I think we might be able to do something like this now. I threw together a pretty janky proof of concept that in theory will work something like this:
- User creates MemoryFS and passes a maximum file system size.
- MemoryFS creates a
FileSizeWatcherthat spins off a separate thread. - Whenever a file is created, the MemoryFS will notify the watcher about it. The watcher keeps a weak reference to the file object, so it won't prevent files from being closed.
- The file watcher thread will periodically look at all the files it's aware of and take the sum of all the file sizes. If it exceeds the threshold, it will choose a file and evict it to the disk, repeating the process until we're back below the limit. (My overkill implementation allows the algorithm to be chosen.)
This implementation suffers from a couple glaring problems, namely:
- It assumes
SpooledTemporaryFileis threadsafe - It periodically polls memory usage instead of acting immediately
- Because the counting system relies on
tell(), text files report their size in characters, not bytes. Thus, a UTF-16 file would falsely report its memory usage as half of what it actually is, since each character is stored as two bytes.
It's crap, but it's a start. (Note -- I had to rename the file with ".txt" as the file extension to upload it. Change it to ".py" if you download it.)