diskv icon indicating copy to clipboard operation
diskv copied to clipboard

ReadStream with a very large value results in excessive memory use when cache is enabled

Open floren opened this issue 2 years ago • 2 comments

If the cache is enabled, readWithRLock always reads the file using a siphon.

The siphon code copies every byte it reads into a bytes.Buffer. When the full file has been read, that bytes.Buffer is used to update the cache.

However, if the underlying file is e.g. a gigabyte in size, the siphon will end up with a bytes.Buffer containing that entire gigabyte. Unless you've set your cache size to over a gigabyte, this gets thrown away as soon as the ReadStream is done.

The main reason we use ReadStream is so we can deal with very large items without having to stick the entire thing in memory at once. Having discovered this, we'll probably disable the cache, but there are cases where people may wish to have a cache enabled without blowing up their memory!

floren avatar Aug 18 '21 20:08 floren

Do you use diskv in situations where there are values with such disparate sizes, and you hope to cache the smaller ones but not cache the larger ones?

peterbourgon avatar Aug 19 '21 00:08 peterbourgon

I do use diskv in situations where some values are multiple gigabytes, and some are multiple megabytes.

We noticed this behavior when trying to figure out why sending items from one node's diskv to another took up so much memory. Once we figured out what was going on, we disabled the cache, but thought we'd report the behavior and offer a fix. I believe if you configure a 100MB cache, and you read a 500MB item via ReadStream, you'd be surprised to learn that a diskv makes a complete in-memory copy of the item before immediately and always throwing it away.

If you're not particularly worried about that corner case, feel free to close this issue and the related PR.

floren avatar Aug 19 '21 00:08 floren