Megatron-LM icon indicating copy to clipboard operation
Megatron-LM copied to clipboard

[QUESTION]Splitting large document and bucketing

Open shafiqabedin opened this issue 6 months ago • 0 comments

I asked this question in the discussion section but did not receive any response so asking here with a little bit more details. I am trying to figure out if bucketization is done (or can be done) for model training. By "bucketization" I am referring to batching based on similar sequence length (https://torchtext.readthedocs.io/en/latest/data.html#bucketiterator).. The motivation is that I have documents with very large text and I want pick a splitting schema (which may create a lot of samples with small number of tokens and bucketize them). That brings me to the second question which is - is splitting supported in Megatron?

Any answer would be much appreciated - Thanks.

shafiqabedin avatar Aug 07 '24 18:08 shafiqabedin