Stephan Tulkens comments

Results 28 comments of


                                            Stephan Tulkens

Support for regex in Replace normalization function

Ok, that's great! Thanks for the quick reply. Is this mentioned in the docs somewhere?

这套代码怎么运行？

Hi, the dataset can be downloaded here: https://github.com/ruidan/Unsupervised-Aspect-Extraction We trained the embeddings on the SemEval 2014 and 2015 corpora, which you can download here: http://alt.qcri.org/semeval2014/task4/ and here: http://alt.qcri.org/semeval2015/task12/

这套代码怎么运行？

Hi! This is just an example, there is no file called "my_data.conllu". In order to work with the pipeline, the code needs to have data in CoNLL-U format (see here:...

ERROR: Failed building wheel for tokenizers

M1 user here. I got the same error, installing the rust compiler fixed this for me.

ERROR: Failed building wheel for tokenizers

@alibrahimzada I installed it with homebrew

Docs say you can pass token ids to `.encode()`, but it throws an exception when you do

This is the wrong repository for this issue. The transformers tokenizer package are not the same as the tokenizers in this repository, although very similar.

Training script

For those interested: I created a python script that creates a sentencepiece model on a training corpus, after which it segments the corpus, and trains BPE embeddings. The end result...

Turn print statements into logging

Thanks for the response, I'll wait! If you want, you can ping me when this can be started.

Turn print statements into logging

@DivyanshVinayak23 Sure, I totally forgot to pick this up. (@matsui528 my apologies 🙏 )

Why are the models fine-tuned with CosineSimilarity between 0 and 1?

To chime in here: As mentioned, I think it is important to realize that in the cosine similarity 0 means orthogonal, while -1 means opposite. In particular, for every normalized...