Stephan Tulkens

Results 28 comments of Stephan Tulkens

I'll open a PR to take a stab at it, and will let you know!

Hey @seden, Out of interest, what is the reduction in loading time you get when moving from .bin to .vec using Gensim? I was under the impression that loading from...

Hey @MosheWasserb , Thanks for replying, really appreciated. Before I submit a PR, could we maybe discuss what you want the final conclusion of the article to look like? Because...

Hi, Thanks for the quick response. I'll check out the code! The model namespace mentions that it was trained using `data/wiki_100.vec`. But is this correct? I'm assuming that this is...

Hi @4722794 The empirically most valid strategy would actually be to pick the label that is most frequent in your training data in case of a tie, instead of a...

Hey @Jester6136 , Yep, this is because of the high threshold. My apologies, this is rather inefficient. @Pringled contributed a fix, which is now in PR, see here #73 ....

I just merged #73. Could you try the new function? It just returns indices, so that should be a lot faster.

Hello, I don't think there is a neat solution for this particular issue, but you can bypass it by using `pre_tokenized=True` in the call to `encode`, and just pretokenizing beforehand....

In the final line, you push `list[list[str]]`, but I think the iterator expects `list[str]`. I think you can achieve what you want by letting `tokenize_function` return a string: `" ".join([t.text...

Hey @ArthurZucker , thanks for your response! I'm using the pure `tokenizers` API. However, I am using a WordPiece tokenizer (actually just the `baai/bge-base-en-v1.5` tokenizer, which AFAIK is just the...