Stas Bekman
Stas Bekman
This appears to be the same issue I think: https://github.com/huggingface/datasets/issues/4528 I dug into the repro code there and it's the same behavior with the same leak, but it's a pure...
I went all the way back to `pyarrow==1.0.0` and `datasets==1.12.0` and the problem is still there. How is it even possible that it wasn't noticed all this time. Could it...
Also found this warning > Be careful: if you don't pass the ArrowArray struct to a consumer, > array memory will leak. This is a low-level function intended for >...
Yes, we have already established here https://github.com/huggingface/datasets/issues/4883#issuecomment-1232063891 that when one iterates over the whole dataset multiple times, it consumes a bit more memory in the next few repetitions and then...
Thank you for clarifying, Ross. I think we agree that it's almost certain that the `datasets` iterator traps some inner variable that prevents object freeing, since if we create the...
# Notes After reading many issues and trying many things here is the summary of my learning I'm now using @lhoestq repro case as it's pyarrow-isolated: https://github.com/huggingface/datasets/issues/4883#issuecomment-1242034985 ## 1. pyarrow...
# There is no leak, just badly communicated linux RSS memory usage stats Next, lets revisit @rwightman's suggestion that there is actually no leak. After all - we are using...
The original leak in the multi-modal code is very likely something else. But of course now it'd be very difficult to trace it using mmap. I think to debug we...
@lhoestq, I have been working on a detailed article that shows that MMAP doesn't leak and it's mostly ready. I will share when it's ready. The issue is that we...
as I suggested on slack perhaps it was due to dataset records length variation, so with your help I wrote another repro with synthetic records which are all identical -...