Peter comments

Results 216 comments of


                                            Peter

Phi

> I can make a proper repository for SafeTensors if needed That would be better

Improve documentation and take inspiration from python package

What about a section for people from the Python world? Or anything you found that is missing or unclear in the current document.

Improve documentation and take inspiration from python package

Make sure you add a warning in the doc that `HuggingFaceDatasets.jl` would run Python under the hood

Load local model

Currently that is not supported. See #143

Help with ModernBert tokenizer (BPE+special tokens)

Did you check and compare the result of `Transformers.HuggingFace.load_tokenizer("answerdotai/ModernBERT-base")`?

Help with ModernBert tokenizer (BPE+special tokens)

Could you elaborate more on your attempt and what failed?

Help with ModernBert tokenizer (BPE+special tokens)

`MatchTokenization` should return the exact match without extra space (that is actually something achievable with huggingface/transformers' tokenizer but not implemented here). I couldn't reproduce the issue, given: ```julia-repl _tkr =...

Help with ModernBert tokenizer (BPE+special tokens)

@svilupp The [hardcoded result in the test](https://github.com/svilupp/ModernBert.jl/blob/01819a6a762eb0d6ff8ca0c63e3d0418b9a48ce9/examples/verify.jl#L35) has an extra space. Is that expected?

Help with ModernBert tokenizer (BPE+special tokens)

Ah, I see. So it uses exactly the same unimplemented feature I mentioned above. Let me check the spec of that behavior.

Help with ModernBert tokenizer (BPE+special tokens)

So that extra space part is actually quite simple. You can just replace the special_token string with `Regex(raw"\s*" * Base.wrap_string(special_token, UInt32(0)))`, which should return something like `r"\s*\Q[MASK]\E"`, when you extracting...