tokenizers
tokenizers copied to clipboard
Parallelize unigram trainer
closes https://github.com/huggingface/tokenizers/issues/933
Parallelize Unigram Trainer
After following @Narsil suggestions, this PR is based on #921 by implementing MaybeParallelSlice
Here is the benchmark info:
how many cores?
rayon::current_num_threads on my machine gives me 12
Created unigram_benchmark.rs, here are the results of the run:
on main
(current serial impl on main branch)
Unigram Train vocabulary (medium)
time: [6.1652 s 6.2645 s 6.3822 s]
on this PR, but parallelism set to false
Unigram Train vocabulary (medium)
time: [6.4311 s 6.5944 s 6.7772 s]
on this PR, and parallelism set to true
Unigram Train vocabulary (medium)
time: [2.1478 s 2.2398 s 2.3247 s]
Hey @mishig25 Thanks for this !
have you actually checked the previous implementation which uses par_chunks
?
https://github.com/huggingface/tokenizers/pull/921
I didn't do a deep dive into your implementation yet, but it seems to touch closer to core algorithms (populate_marginal
for instance) which are things I feel shouldn't be done lightly since you're coupling business logic to parallelization strategy.
If the previous PR doesn't work, or isn't as efficient, then let's definitely open the room for more custom logic like the one in this PR, but I think we need to prove that simple approaches don't work (or don't work good enough)
I think the previous PR has lot of merits (much lower complexity). It also is the exact same implementation as the original sentence piece: https://github.com/google/sentencepiece/blob/master/src/unigram_model_trainer.cc#L215-L242
all the chunks of text to be trained on are sharded on every core, every core does it's chunk and then there's a simple reduce operation.
This can be done both en E and M step.
Also you mention 2.76 speedup, but on how many cores ?
The main caveat with the former approach is that we need to respect TOKENIZERS_PARALLELISM=false
to disabled parallelism entirely (if a user is handling parallelism elsewhere we need to make it so that's we're not competing for threads).
but I think doing maybe_par_chunks
like we added maybe_par_iter
should be relatively straightforward, wdyt ?
The documentation is not available anymore as the PR was closed or merged.
@Narsil I applied the suggested changes & updated the description
@Narsil this seems to be the most recent and updated effort to parallelize the unigram trainer. What kind of development or testing is required to move this forward? I would like to contribute.
I'm not sure anything else than rebasing an checking everything works is needed actually.
I prepared a commit that simply removed the 'data lifetime to fix the clippy warning. I could not push due to insufficient permissions. The fix builds and passes all tests.
@chris-ha458 please feel free to open a new PR 😊
I think current fix requirements are minor enough that anybody inlcuding you @mishig25 or @Narsil could easily fix so would not merit a whole new PR at this time. (as i've said, merely removing a single lifetime assignment is enough for atleast rust side to pass with flying colors).
If the regressions in the smaller benchmarks(which is to be expected in a parallelization effort) seem to warrant more investigation and current contributors to this PR are unwilling or unable to investigate further, I will open a PR at that time to consolidate investigation and development efforts to bring this effort to fruition.
@Narsil should I merge it ? I haven't done anything on this PR for months
personally, I cannot see why it shouldn't be. As far as I can tell there has not been any material change in or around unigram model(unigram/model.rs), unigram trainer(unigram/trainer.rs) or parallelism model(util/parallelism.rs) that might warrant any modifications to this PR.
Rather, I think it would be useful to get it merged as soon as appropriate since it could lay the base for further optimizations. I was hoping to optimize some previous steps before during or after the suffix array construction, and it would be useful to have this merged so I know what I would be working with.
If any further development or testing needs arise, I hope anybody involved feel free to point them out. here's hoping this gets merged sooner rather than later
here's hoping this gets merged sooner rather than later
I don' t like merging old stuff without proper checking.
I think training something on main
on big.txt
then the same with this PR included and verifying tokenization is the same should be enough.
The following zip file contains tokenizer trained on big.txt on main, tokenizer trained from this PR and their diffs. It seems like the learned vocabulary is the same while the negative log probabilities are different minutely. difference is mere rounding errors. (usually 10^14) I think this is in the realm of mere floating point errors that would not have any difference in performance.
Thanks so much for looking into it !!