fms-fsdp icon indicating copy to clipboard operation
fms-fsdp copied to clipboard

tokenization on-the-fly for long documents

Open dangxuanhong opened this issue 1 year ago • 2 comments

As we may have to deal with very long documents up to millions of characters/tokens, the dataloader may need to be tested and revised as needed when it aims at tokenizing these long documents on-the-fly.

An approach of splittng a long document into chunks should be considered as an example here.

dangxuanhong avatar Jul 31 '24 17:07 dangxuanhong

The problem is not with long documents, I tried by splitting the long documents into chunks

Removing the SamplingDataSet that is used in multi-dataset handing allows us to bypass the problem.

The SamplingDataSet has more heterogeneity than iterating from one entire file to the next. We do want document mixing between datasets. Although the SamplingDataSet shouldn't cause every file to open, but rather one from each dataset, it seems like it is opening all parquet files causing the node to go out of memory

thinkahead avatar Aug 07 '24 14:08 thinkahead

Checking on the status of this - the memory consumption ended up being related to how the legal-file-detection was working IIRC?

daviswer avatar Aug 29 '24 15:08 daviswer