What does this PR do?

This PR adds a classes to use sudachi and jumanpp with BertJapaneseTokenizer.

As a background, there are traditionally multiple tokenizers in Japanese language processing, and for various reasons one may wish to use a tokenizer other than mecab.(e.g. consistency issues with pre-bert models, or require accurate tokenization results in a particular case, etc.) For this reason, it is common practice in some models to pre-tokenize text before putting it into transformers (like https://huggingface.co/nlp-waseda/roberta-base-japanese#tokenization).

This PR adds a sudachi and jumanpp, popular japanese tokenizers other than mecab, to do all the process in transformers library.

Before submitting

[ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[x] Did you read the contributor guideline, Pull Request section?
[ ] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
[x] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
[x] Did you write any new necessary tests?

Who can review?

Models bert: @LysandreJik Documentation: @sgugger

and thank you @hiroshi-matsuda-rit to check this change before submitting

Sep 15 '22 04:09 r-terada

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

Sep 15 '22 04:09 HuggingFaceDocBuilderDev

@sgugger

Thanks for your quick review!

I fixed error exception: 1b6883e multiline test-cases into one line: a329071

Re-tests: not sure why this works right now as I don't think we have a dependency on the two new libs?

Yes, It is strange that tests passed even though I forgot to add libs to the dependency... Anyway, I'd like to add libs, but I'm having trouble writing the dependencies since pyknp is a wrapper that assumes the jumanpp command is installed. Please give me some time to add this change.

Sep 16 '22 12:09 r-terada

For the tests, you will need to rebase on main so we can have them run (I fixed the command launching them yesterday). You should also create decorators require_sudashi and require_pyknp so when the tests are run without those deps uninstalled, all is well.

I'm finishing a cleanup of the file launching those tests this morning. Once it's merged, I can show you how to add installation steps in the custom tokenizers job we run!

Sep 16 '22 12:09 sgugger

Sorry for the late reply, I added require_sudachi and require_jumanpp and rebase main, force-push. It seems to be working properly!

Sep 28 '22 03:09 r-terada

Yes, but the tests are not run since those deps are not installed in the custom tokenizers test job :-) You'll need to add the packages that can be pip-installed directly to the extras["ja"] here and for the packages that require special instructions to install, you will need to add them in this file (follow the same format as the lines before).

Sep 28 '22 13:09 sgugger

Sorry, I misunderstood your comment. I have added commit c889969 and confirmed that the sudachi, jumanpp related tests are "PASSED". Rebase main and force-push again since there was a conflict with the main branch. Sorry for the messy commit history.

Oct 05 '22 10:10 r-terada

It seems there is an issue with your CircleCI permissions, the tests won't run. Could you try refreshing your permissions as shown here?

Oct 05 '22 14:10 sgugger

thanks, I refreshed permissions and it seems to start to run! (And I realized I didn't need additional commit, sorry)

Oct 05 '22 14:10 r-terada

No, you still don't have the tests running.

Oct 05 '22 14:10 sgugger

Despite run_tests_hub succeeded for b8bf0b0, run_tests_hub failed for 58a0eb6 even though it is an empty commit. I think the empty commit may have had a negative impact, so I'm going to rebase these and force-push again.

Oct 05 '22 15:10 r-terada

Those tests are a bit flaky, so don't worry!

Oct 05 '22 15:10 sgugger

transformers
transformers copied to clipboard

Add sudachi and jumanpp tokenizers for bert_japanese

What does this PR do?

Before submitting

Who can review?

transformers transformers copied to clipboard

Add sudachi and jumanpp tokenizers for bert_japanese

What does this PR do?

Before submitting

Who can review?

transformers
transformers copied to clipboard