Megatron-LM
Megatron-LM copied to clipboard
[QUESTION]Splitting large document and bucketing
I asked this question in the discussion section but did not receive any response so asking here with a little bit more details. I am trying to figure out if bucketization is done (or can be done) for model training. By "bucketization" I am referring to batching based on similar sequence length (https://torchtext.readthedocs.io/en/latest/data.html#bucketiterator).. The motivation is that I have documents with very large text and I want pick a splitting schema (which may create a lot of samples with small number of tokens and bucketize them). That brings me to the second question which is - is splitting supported in Megatron?
Any answer would be much appreciated - Thanks.