tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

Parallelize unigram trainer

Open mishig25 opened this issue 2 years ago • 3 comments

closes https://github.com/huggingface/tokenizers/issues/933

Parallelize Unigram Trainer

After following @Narsil suggestions, this PR is based on #921 by implementing MaybeParallelSlice

Here is the benchmark info:

how many cores?

rayon::current_num_threads on my machine gives me 12

Created unigram_benchmark.rs, here are the results of the run:

on main (current serial impl on main branch) Unigram Train vocabulary (medium)
time: [6.1652 s 6.2645 s 6.3822 s]

on this PR, but parallelism set to false Unigram Train vocabulary (medium)
time: [6.4311 s 6.5944 s 6.7772 s]

on this PR, and parallelism set to true Unigram Train vocabulary (medium)
time: [2.1478 s 2.2398 s 2.3247 s]

mishig25 avatar Apr 06 '22 16:04 mishig25

Hey @mishig25 Thanks for this !

have you actually checked the previous implementation which uses par_chunks ?

https://github.com/huggingface/tokenizers/pull/921

I didn't do a deep dive into your implementation yet, but it seems to touch closer to core algorithms (populate_marginal for instance) which are things I feel shouldn't be done lightly since you're coupling business logic to parallelization strategy. If the previous PR doesn't work, or isn't as efficient, then let's definitely open the room for more custom logic like the one in this PR, but I think we need to prove that simple approaches don't work (or don't work good enough)

I think the previous PR has lot of merits (much lower complexity). It also is the exact same implementation as the original sentence piece: https://github.com/google/sentencepiece/blob/master/src/unigram_model_trainer.cc#L215-L242

all the chunks of text to be trained on are sharded on every core, every core does it's chunk and then there's a simple reduce operation.

This can be done both en E and M step.

Also you mention 2.76 speedup, but on how many cores ?

The main caveat with the former approach is that we need to respect TOKENIZERS_PARALLELISM=false to disabled parallelism entirely (if a user is handling parallelism elsewhere we need to make it so that's we're not competing for threads).

but I think doing maybe_par_chunks like we added maybe_par_iter should be relatively straightforward, wdyt ?

Narsil avatar Apr 11 '22 15:04 Narsil

The documentation is not available anymore as the PR was closed or merged.

@Narsil I applied the suggested changes & updated the description

mishig25 avatar May 10 '22 11:05 mishig25

@Narsil this seems to be the most recent and updated effort to parallelize the unigram trainer. What kind of development or testing is required to move this forward? I would like to contribute.

chris-ha458 avatar May 18 '23 10:05 chris-ha458

I'm not sure anything else than rebasing an checking everything works is needed actually.

Narsil avatar May 19 '23 08:05 Narsil

I prepared a commit that simply removed the 'data lifetime to fix the clippy warning. I could not push due to insufficient permissions. The fix builds and passes all tests.

chris-ha458 avatar May 19 '23 11:05 chris-ha458

@chris-ha458 please feel free to open a new PR 😊

mishig25 avatar May 19 '23 12:05 mishig25

I think current fix requirements are minor enough that anybody inlcuding you @mishig25 or @Narsil could easily fix so would not merit a whole new PR at this time. (as i've said, merely removing a single lifetime assignment is enough for atleast rust side to pass with flying colors).

If the regressions in the smaller benchmarks(which is to be expected in a parallelization effort) seem to warrant more investigation and current contributors to this PR are unwilling or unable to investigate further, I will open a PR at that time to consolidate investigation and development efforts to bring this effort to fruition.

chris-ha458 avatar May 19 '23 12:05 chris-ha458

@Narsil should I merge it ? I haven't done anything on this PR for months

mishig25 avatar May 22 '23 09:05 mishig25

personally, I cannot see why it shouldn't be. As far as I can tell there has not been any material change in or around unigram model(unigram/model.rs), unigram trainer(unigram/trainer.rs) or parallelism model(util/parallelism.rs) that might warrant any modifications to this PR.

Rather, I think it would be useful to get it merged as soon as appropriate since it could lay the base for further optimizations. I was hoping to optimize some previous steps before during or after the suffix array construction, and it would be useful to have this merged so I know what I would be working with.

If any further development or testing needs arise, I hope anybody involved feel free to point them out. here's hoping this gets merged sooner rather than later

chris-ha458 avatar May 22 '23 10:05 chris-ha458

here's hoping this gets merged sooner rather than later

I don' t like merging old stuff without proper checking. I think training something on main on big.txt then the same with this PR included and verifying tokenization is the same should be enough.

Narsil avatar May 22 '23 11:05 Narsil

The following zip file contains tokenizer trained on big.txt on main, tokenizer trained from this PR and their diffs. It seems like the learned vocabulary is the same while the negative log probabilities are different minutely. difference is mere rounding errors. (usually 10^14) I think this is in the realm of mere floating point errors that would not have any difference in performance.

diffs.zip

chris-ha458 avatar May 22 '23 13:05 chris-ha458

Thanks so much for looking into it !!

Narsil avatar May 22 '23 13:05 Narsil