BertTokenizers icon indicating copy to clipboard operation
BertTokenizers copied to clipboard

Open source project for BERT Tokenizers in C#.

Results 15 BertTokenizers issues
Sort by recently updated
recently updated
newest added

https://github.com/NMZivkovic/BertTokenizers/blob/150e40a178902bd258d4c9986dc1485c25c404b3/src/Helpers/VocabularyReader.cs#L18 https://huggingface.co/p208p2002/zh-wiki-punctuation-restore/blob/main/vocab.txt happens to contain `"\u2028"`

I would expect tokenizer's behavior to match Python version otherwise it will be hard to convert samples from Python to .NET 1. tokenizer.Encode should stop when sequenceLength is reached instead...

The following code: ```csharp var res = vocabulary.Tokenize("point™"); ``` never returns if `™` cannot be matched in the vocabulary. The issue was introduced in this commit: https://github.com/NMZivkovic/BertTokenizers/commit/0f29cefd5bcdc3dcfdea5b9d0133ccbc1d0d5023#diff-82215a359c504385d48356d59d6635f3b968278cca935c73977e16cea13f4174 Specifically this line:...

The package can already easily support additional .net versions. Please accept this commit, so we'll be able to reference the project as-is instead of creating a fork for this purpose....

Inside `BertUncasedCustomVocabulary.cs`, I would expect to see the declaration for a class extending `UncasedTokenizer`, however the class has the following signature `public class BertCasedCustomVocabulary : CasedTokenizer`. Something similar happens inside...

Fixes https://github.com/NMZivkovic/BertTokenizers/issues/24

This causes an infinite loop: ```cs var _tokenizer = new BertUncasedBaseTokenizer(); var sentence = "Linux\nline\nendings"; var tokens = _tokenizer.Tokenize(sentence); ``` The problem is that the `TokenizeSentence` method doesn't have '\n'...

The Bert*Tokenizer classes look in `"./Vocabularies"` which usually refers to the source folders, or at any rate where you have run it _from_, rather than where it is running. That...