FlagEmbedding
FlagEmbedding copied to clipboard
Fix "idx" bug in split_data_by_length.py of BGE-M3
In the split_data_by_length.py code inside BGE-M3, after filtering the dataset by "max_length" field, the "idx" field is somehow changed , so the split_dataset = dataset.select(idxs["idx"]) will result in the wrong data.
To deal with this issue, I suggest using the real list of "idx" given by list(idxs._indices.to_pandas()['indices'].values) .