Guolin Ke

Results 163 comments of Guolin Ke

Thanks for the quick response. Actually, I made `the` as a special token as a workaround, since other high-frequency (like `and`) tokens seems ok.

@Narsil Thanks, I just check it, and seems both `th` and `he` is there. I think it is better to fix the overflow problem, in case we have the larger...

seems `u64` branch cannot fix the problem. the error ``` thread '' panicked at 'Trainer should know how to build BPE: MergeTokenOutOfVocabulary("##h##e")', /home/xxx/tokenizers/tokenizers/src/models/bpe/trainer.rs:573:13 note: run with `RUST_BACKTRACE=1` environment variable to...

It seems the master branch also failed with the same error. the same code in 0.8.1 could be run. ``` bpe = BertWordPieceTokenizer(clean_text=True, strip_accents=True, lowercase=True) bpe.train(args.inputs) ``` updated: it seems...

master branch seems is much slower in file reading. 0.8.1 ``` [00:03:13] Reading files (19521 Mo) ███████████████████████████████████████████████████████████████████████████████████████████████████████████ 100 [00:00:17] Tokenize words ███████████████████████████████████████████████████████████████████████████████████████████████████████████ 6593316 / 6593316 [00:00:14] Count pairs ███████████████████████████████████████████████████████████████████████████████████████████████████████████ 6593316...

I used this commit https://github.com/huggingface/tokenizers/commit/36832bfa1292b298c36447d7d2f5a3366c26644c, as the latter of it cannot run. I also set RAYON_RS_NUM_CPUS=16 , maybe the multi-threading performance is down?

The fix looks good to me! thank you @YuriWu BTW, should we add a test for it?

sorry for missing this issue. The max-bin actually cannot limit the number of bins for categorical feature. there are two workarounds: 1) use the categorical encodings, converting categorical features to...

EFB is used by default. GOSS is disabled by default, you need to set "boosting=goss" to enable it.

refer to https://github.com/ibr11/LightGBM