Guolin Ke comments

Results 163 comments of


                                            Guolin Ke

"the" token is splitted to "t" "h" "e" in large scale corpus

Thanks for the quick response. Actually, I made `the` as a special token as a workaround, since other high-frequency (like `and`) tokens seems ok.

"the" token is splitted to "t" "h" "e" in large scale corpus

@Narsil Thanks, I just check it, and seems both `th` and `he` is there. I think it is better to fix the overflow problem, in case we have the larger...

"the" token is splitted to "t" "h" "e" in large scale corpus

seems `u64` branch cannot fix the problem. the error ``` thread '' panicked at 'Trainer should know how to build BPE: MergeTokenOutOfVocabulary("##h##e")', /home/xxx/tokenizers/tokenizers/src/models/bpe/trainer.rs:573:13 note: run with `RUST_BACKTRACE=1` environment variable to...

"the" token is splitted to "t" "h" "e" in large scale corpus

It seems the master branch also failed with the same error. the same code in 0.8.1 could be run. ``` bpe = BertWordPieceTokenizer(clean_text=True, strip_accents=True, lowercase=True) bpe.train(args.inputs) ``` updated: it seems...

"the" token is splitted to "t" "h" "e" in large scale corpus

master branch seems is much slower in file reading. 0.8.1 ``` [00:03:13] Reading files (19521 Mo) ███████████████████████████████████████████████████████████████████████████████████████████████████████████ 100 [00:00:17] Tokenize words ███████████████████████████████████████████████████████████████████████████████████████████████████████████ 6593316 / 6593316 [00:00:14] Count pairs ███████████████████████████████████████████████████████████████████████████████████████████████████████████ 6593316...

"the" token is splitted to "t" "h" "e" in large scale corpus

I used this commit https://github.com/huggingface/tokenizers/commit/36832bfa1292b298c36447d7d2f5a3366c26644c, as the latter of it cannot run. I also set RAYON_RS_NUM_CPUS=16 , maybe the multi-threading performance is down?

Failed to load Dataset.subset() back after Dataset.save_binary()

The fix looks good to me! thank you @YuriWu BTW, should we add a test for it?

[Bug] LightGBMError: bin size 257 cannot run on GPU

sorry for missing this issue. The max-bin actually cannot limit the number of bins for categorical feature. there are two workarounds: 1) use the categorical encodings, converting categorical features to...

How to use GBDT with EFB or GOSS with EFB in LightGBM? or is it implemented by default?

EFB is used by default. GOSS is disabled by default, you need to set "boosting=goss" to enable it.

Minimal Variance Sampling in Stochastic Gradient Boosting

refer to https://github.com/ibr11/LightGBM