pyreft icon indicating copy to clipboard operation
pyreft copied to clipboard

[P0] Make `make_last_position_supervised_data_module` parallelizable to speed up processing!

Open truskovskiyk opened this issue 9 months ago • 2 comments

Hey team,

I am having issues with large datasets (~10k samples or more).

Calling the make_last_position_supervised_data_module function is slower than the training itself. The root cause is that the function uses a for loop to process each sample individually: link.

Instead of processing samples individually, we could perform this operation in batch mode. For example, we could use "batch mapping" as described here: Hugging Face Documentation.

Could we add an option to perform this operation in batch mode?

I am happy to send a PR with this change.

truskovskiyk avatar May 13 '24 01:05 truskovskiyk