litdata icon indicating copy to clipboard operation
litdata copied to clipboard

`litdata.optimize()` function returns without raising error, prior to all processes finishing work

Open JacobARose opened this issue 7 months ago • 1 comments

🐛 Bug

This may be a more difficult problem to reproduce than I have the time for, but hopefully someone will have some insight into it.

This occurs when I run a python script from command line which calls the litdata.optimize() command with 4 workers on a 4-core CPU on lightning studio.

I noticed when running litdata.optimize() on a fairly large (~160 GBs) dataset, the script would end but my CPU usage remained near 100%. When I checked the running processes in htop, I could see the multiprocess spawn command still running. Additionally, when I looked in the destination folder I could see processes still depositing new binary chunks even after the script had returned without error. At some point later, I checked back to see the CPU had dropped to normal, no more files were being added, and the destination directory contained about the same amount of data as expected from the original dataset (~160 GBs).

As long as the data is written without errors, this is a manageable inconvenience. However, since I'd like to be able to fully automate this part of the process, having to manually wait and check that the optimized data has finished writing before moving on to the next step is certainly undesirable.

Does anyone have some insight into what could be causing this/perhaps some suggestions on how I could meaningfully reproduce it without using my full 160 GB dataset and code base?

Thanks!

JacobARose avatar Apr 29 '25 02:04 JacobARose

Hey @JacobARose Do you have a Studio that you could share with us ?

tchaton avatar Apr 29 '25 13:04 tchaton

Closing due to inactivity 🙂 Feel free to reopen if the issue persists or if more details become available.

bhimrazy avatar Jun 03 '25 19:06 bhimrazy