pyreft
pyreft copied to clipboard
[P0] Make `make_last_position_supervised_data_module` parallelizable to speed up processing!
Hey team,
I am having issues with large datasets (~10k samples or more).
Calling the make_last_position_supervised_data_module
function is slower than the training itself. The root cause is that the function uses a for loop to process each sample individually: link.
Instead of processing samples individually, we could perform this operation in batch mode. For example, we could use "batch mapping" as described here: Hugging Face Documentation.
Could we add an option to perform this operation in batch mode?
I am happy to send a PR with this change.