alvations comments

Results 155 comments of


                                            alvations

trafficstars

empty lines

I had the same question too when I tried to play around the dataset. The empty lines just means that they are not annotated by the STS organizers. Not all...

To slurp the STS data into a [`sframe` dataframe](https://github.com/dato-code/SFrame), I usually do this: ``` python import sframe # Reads STS2012-2015 dataset. sts_train = sframe.SFrame.read_csv('sts.csv', delimiter='\t', column_type_hints=[str, str, float, str, str],...

STS data set

Take a look at issue #1 and the google group link. The blank score just means that the organizers didn't collect the annotations for the particular sentence pair.

[NLP-OSS 2020] Program Committee Suggestions/Confirmation

@goodmami @tteofili @icedwater Thank you for accepting the invitation! @flammie Tommi! Definitely glad to have you join us for the workshop! Sending you the invitation email =)

[NLP-OSS 2020] Program Committee Suggestions/Confirmation

@jnothman @amittai @shamilcm @vene @shicode @geetickachauhan @alinepaes @ivanvladimir @svakulenk0 Thank you for accepting the invitation!! @ggdupont Thank you for the points on the tagline and your interest! We've sent you...

[NLP-OSS 2020] Program Committee Suggestions/Confirmation

Thank you again for the confirmation @bethard @dennybritz @almoslmi @ggdupont @mhagiwara @yvespeirsman ! Sorry about the confusion, starring the repo worked for the first edition after that we overlooked that...

Places to promote the workshop CFP

Thanks Lucy!! Let me try to translate that with my horrible Korean and with the help of MT =) ``` Are you using a lot of open source while developing...

wordnet.doctest contains references to bugs by number, but they're not issues from this GitHub repo

@ExplodingCabbage, possibly @fcbond would know.

Tokenization for Hindi (e.g. `क्या`) is weird

Expected behavior for zh and ko: ``` $ echo "记者应谦美国" | ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l zh Tokenizer Version 1.1 Language: zh Number of threads: 1 记者应谦美国 $ echo...

Tokenization for Hindi (e.g. `क्या`) is weird

Looks like it's the the `unichars` list and the `perluniprops` list of Alphanumeric is a little different. The issue comes from https://github.com/alvations/sacremoses/blob/master/sacremoses/tokenize.py#L420 where the non-alphanumeric characters are padded with spaces....