Panos Kanavos

Results 28 comments of Panos Kanavos

Some more feedback: I updated pyonmttok and OpenNMT-tf and tried to build a new vocab with sentencepiece and `case_markup`. The sp model and the vocab are build, but the user-defined...

I should get a better grasp of it, so I could use your help. First here is the command: ```BASH onmt-build-vocab --tokenizer_config ../../../Tokenization/lower_tokenization.yml --size 32000 --sentencepiece user_defined_symbols="⦅D01⦆,⦅D02⦆,⦅D03⦆,⦅D04⦆,⦅D05⦆,⦅mrk_case_modifier_C⦆,⦅mrk_case_modifier_L⦆,⦅mrk_case_modifier_U⦆,⦅mrk_case_modifier_M⦆,⦅mrk_case_modifier_N⦆,⦅mrk_begin_case_region_C⦆,⦅mrk_begin_case_region_L⦆,⦅mrk_begin_case_region_U⦆,⦅mrk_begin_case_region_M⦆,⦅mrk_begin_case_region_N⦆,⦅mrk_end_case_region_C⦆,⦅mrk_end_case_region_L⦆,⦅mrk_end_case_region_U⦆,⦅mrk_end_case_region_M⦆,⦅mrk_end_case_region_N⦆" character_coverage=1 input_sentence_size=10000000 num_threads=16...

Hi @dmar1n , You can use the option `spacer_annotate` in which case the joiner is the same symbol used by sentencepiece. @guillaumekln , Apologies for the naive typo, indeed now...

Thanks for the explanations @guillaumekln , I see. > But I'm not sure to understand the use case of user-defined symbols with 0 frequency. If they are not in the...

After a few tests, I can confirm that the user-defined symbols must be included in the vocab. Apart from any custom symbols (which can be included in the corpus for...

Well... I was using a lowercased version of my corpus with `onmt-build-vocab` (facepalm). This explains the absence of the case-markup symbols from the vocab but it still doesn't explain the...

> Just to note that when using a pretokenization, input_sentence_size corresponds to a number of words, since the SentencePiece model is trained at the word-level and not the sentence-level. You...

Hello, I had built the wheel successfully on a Mac M1 but with the ICU libs installed with brew. If I recall correctly, I had to define the ICU cppflags...

I'll search for it later tonight and share it here.

@lukasloetkolben , Here it is, hope it works. [pyonmttok-1.31.0-cp39-cp39-macosx_11_0_arm64.whl.zip](https://github.com/OpenNMT/Tokenizer/files/8951512/pyonmttok-1.31.0-cp39-cp39-macosx_11_0_arm64.whl.zip)