Andrew Ho
Andrew Ho
@bratao totally understand, thank you, this is enough to go on, will report back soon :) obrigado
so a small update, I copy/pasted the sample files until I got around 1.6M lines of JSON-L, which takes around 12 minutes (estimated) to load on my machine with the...
@bratao got it so the end goal is training, in that case I suggest we go with a streaming model of packing, then you won't wait for the job to...
@joecummings agree that both are still necessary. It's going to be similar foundational work: land some version of streaming packer (could be in torchdata nodes), and then setting up some...
@bratao just to set expectations, I'll be out for Christmas and new years, and we'll get going on this in January, hope that's alright!
@bratao after some digging, thats because torchdata hasn't been integrated to this recipe yet so those settings would have no effect
@bratao happy new year! I put a small demo together of the most straightforward solution: https://github.com/pytorch/torchtune/compare/main...andrewkho:torchtune:andrewkh/parallel-packer?expand=1 You'll need torchdata's nightly build to test this: `pip install --index-url https://download.pytorch.org/whl/nightly/cpu torchdata` I...
@bratao sorry for the delay in response Bruno! I believe this will likely ship, but not sure exactly which version. Someone will be back to you soon, cc @ebsmothers @divyanshk
Thanks for the feature request @rravu3 , unfortunately we will be deprecating and then deleting DataPipes/DataLoaderV2, please see this issue: [Future of torchdata and dataloading](https://github.com/pytorch/data/issues/1196)
Hi @AshwinSankar17 thanks for the report! Your stack trace is pretty mangled and hard to read, but I managed to guess at what's going on: It looks looks this is...