Nicolas Patry comments

Results 977 comments of


                                            Nicolas Patry

tokenizer.save_vocabulary()

Hi @kkavyashankar0009 , Sorry but this contains just an extract of your code and I can't reproduce this it contains many missing bits and many things totally unrelated to the...

tokenizer.save_vocabulary()

Do you mind sharing what was the issue ? It could help future readers.

Include return type annotation in `Encoding` class properties?

@willfrey Thanks for the info. Currently we cannot include type annotations because the source also supports`signature(fn)` and `help(fn)` (in notebooks, REPLs) and those don't work properly with type annotations. Also...

Incorrect offsets after replace with special token

Entirely correct ! I didn't pinpoint the issue yet, but it seems to just output the offsets of the last digit regardless of how many digits there are in the...

Count number of tokens toeknizer might produce without really tokenizing?

Sorry but no, there's no fast way to know, unless you do the full tokenization. Mileage may vary, and on specific tokenizers you could go faster than this lib because...

Count number of tokens toeknizer might produce without really tokenizing?

> I'm working on a task to compare function disassembly from binary files, maxmium token length of each function is set to 512, but for functions larger than 512, I...

char_to_token is broken when is_split_into_words is set to True

Hi @zorikg , I will look a bit more in detail, but is there any reason you presplit your input here ? It seems like `tokenizer(s)` should do exactly what...

char_to_token is broken when is_split_into_words is set to True

Ok, I looked into it, and it seems you just need to actually send `sequence_index` to you `char_to_token` function. ```python for sequence_index, split in enumerate(s.split(" ")): for char_index, c in...

Addition of CONTRIBUTING.md to Repository

Hi @beneyal , There's not at the moment anything planned, but contributions for one are very welcome ! :) The main thing would be adding and adhering to the CODE_OF_CONDUCT...

Addition of CONTRIBUTING.md to Repository

Hi @msaroufim , I went ahead and created a milestone to regroup various work that we need to get done at some point in the near future (as it will...