FlagEmbedding icon indicating copy to clipboard operation
FlagEmbedding copied to clipboard

Fix "idx" bug in split_data_by_length.py of BGE-M3

Open nntoan209 opened this issue 1 year ago • 0 comments

In the split_data_by_length.py code inside BGE-M3, after filtering the dataset by "max_length" field, the "idx" field is somehow changed , so the split_dataset = dataset.select(idxs["idx"]) will result in the wrong data. To deal with this issue, I suggest using the real list of "idx" given by list(idxs._indices.to_pandas()['indices'].values) .

nntoan209 avatar Mar 22 '24 19:03 nntoan209