lm_dataformat
lm_dataformat copied to clipboard
"current chunk incomplete" without any json1.zst file
I'm trying to write to lmd files - by the ar.commit() method -- but after I create a loop and add a bunch of data to the lmd file - there's only a current chunk incomplete file - with a size of 10GB -- but there isn't any json1.zst file..
Should I instead split the files - and create multiple json1.zst - instead of adding it to the same file? or is there a better fix?
If you call commit()
it should rename the chunk incomplete and have your zst file. I'm not sure about this version, but if you are still interested, you can check out my fork https://github.com/lfoppiano/lm_dataformat
https://github.com/lfoppiano/stackexchange-dataset/blob/master/pairer.py#L82C1-L85C23