BertTokenizers issues

https://github.com/NMZivkovic/BertTokenizers/blob/150e40a178902bd258d4c9986dc1485c25c404b3/src/Helpers/VocabularyReader.cs#L18 https://huggingface.co/p208p2002/zh-wiki-punctuation-restore/blob/main/vocab.txt happens to contain `"\u2028"`

wegylexy

This does not match behavior of Huggingface's Python version

9

I would expect tokenizer's behavior to match Python version otherwise it will be hard to convert samples from Python to .NET 1. tokenizer.Encode should stop when sequenceLength is reached instead...

gevorgter

Word piece tokenizer never exits if a sub-word token doesn't exist

1

The following code: ```csharp var res = vocabulary.Tokenize("point™"); ``` never returns if `™` cannot be matched in the vocabulary. The issue was introduced in this commit: https://github.com/NMZivkovic/BertTokenizers/commit/0f29cefd5bcdc3dcfdea5b9d0133ccbc1d0d5023#diff-82215a359c504385d48356d59d6635f3b968278cca935c73977e16cea13f4174 Specifically this line:...

matteocontrini

support .net462

1

The package can already easily support additional .net versions. Please accept this commit, so we'll be able to reference the project as-is instead of creating a fork for this purpose....

amitportnoy

Fix wrong naming #22

2

See issue #22

tsepton

Custom vocabulary classes naming error

Inside `BertUncasedCustomVocabulary.cs`, I would expect to see the declaration for a class extending `UncasedTokenizer`, however the class has the following signature `public class BertCasedCustomVocabulary : CasedTokenizer`. Something similar happens inside...

tsepton

BertTokenizers
BertTokenizers copied to clipboard

Metadata

CI/CD Pipeline - Automaticly publishing NuGet Package

Wrong vocabulary index after white space

This does not match behavior of Huggingface's Python version

Word piece tokenizer never exits if a sub-word token doesn't exist

support .net462

Fix wrong naming #22

Custom vocabulary classes naming error

Fixing tokenizers to correctly handle linux line endings (\n)

Strings with linux line endings break the tokenizer

Looks for Vocabularies in source dir instead of (e.g.) bin/release/net6/

← Metadata

Owner

Metadata

BertTokenizers BertTokenizers copied to clipboard

Metadata

← Metadata

Owner

Metadata

BertTokenizers
BertTokenizers copied to clipboard