Andrew Ho comments

Results 53 comments of


                                            Andrew Ho

Add multiprocess dataset packing

@bratao totally understand, thank you, this is enough to go on, will report back soon :) obrigado

Add multiprocess dataset packing

so a small update, I copy/pasted the sample files until I got around 1.6M lines of JSON-L, which takes around 12 minutes (estimated) to load on my machine with the...

Add multiprocess dataset packing

@bratao got it so the end goal is training, in that case I suggest we go with a streaming model of packing, then you won't wait for the job to...

Add multiprocess dataset packing

@joecummings agree that both are still necessary. It's going to be similar foundational work: land some version of streaming packer (could be in torchdata nodes), and then setting up some...

Add multiprocess dataset packing

@bratao just to set expectations, I'll be out for Christmas and new years, and we'll get going on this in January, hope that's alright!

Add multiprocess dataset packing

@bratao after some digging, thats because torchdata hasn't been integrated to this recipe yet so those settings would have no effect

Add multiprocess dataset packing

@bratao happy new year! I put a small demo together of the most straightforward solution: https://github.com/pytorch/torchtune/compare/main...andrewkho:torchtune:andrewkh/parallel-packer?expand=1 You'll need torchdata's nightly build to test this: `pip install --index-url https://download.pytorch.org/whl/nightly/cpu torchdata` I...

Andrew Ho

Add multiprocess dataset packing

Add multiprocess dataset packing

Add multiprocess dataset packing

Add multiprocess dataset packing

Add multiprocess dataset packing

Add multiprocess dataset packing

Add multiprocess dataset packing

Add multiprocess dataset packing

Enable Append Mode in SaverIterDataPipe

TypeError: cannot pickle 'torch._C.Generator' object