Peter
Peter
> I can make a proper repository for SafeTensors if needed That would be better
What about a section for people from the Python world? Or anything you found that is missing or unclear in the current document.
Make sure you add a warning in the doc that `HuggingFaceDatasets.jl` would run Python under the hood
Currently that is not supported. See #143
Did you check and compare the result of `Transformers.HuggingFace.load_tokenizer("answerdotai/ModernBERT-base")`?
Could you elaborate more on your attempt and what failed?
`MatchTokenization` should return the exact match without extra space (that is actually something achievable with huggingface/transformers' tokenizer but not implemented here). I couldn't reproduce the issue, given: ```julia-repl _tkr =...
@svilupp The [hardcoded result in the test](https://github.com/svilupp/ModernBert.jl/blob/01819a6a762eb0d6ff8ca0c63e3d0418b9a48ce9/examples/verify.jl#L35) has an extra space. Is that expected?
Ah, I see. So it uses exactly the same unimplemented feature I mentioned above. Let me check the spec of that behavior.
So that extra space part is actually quite simple. You can just replace the special_token string with `Regex(raw"\s*" * Base.wrap_string(special_token, UInt32(0)))`, which should return something like `r"\s*\Q[MASK]\E"`, when you extracting...