alvations

Results 155 comments of alvations
trafficstars

I had the same question too when I tried to play around the dataset. The empty lines just means that they are not annotated by the STS organizers. Not all...

To slurp the STS data into a [`sframe` dataframe](https://github.com/dato-code/SFrame), I usually do this: ``` python import sframe # Reads STS2012-2015 dataset. sts_train = sframe.SFrame.read_csv('sts.csv', delimiter='\t', column_type_hints=[str, str, float, str, str],...

Take a look at issue #1 and the google group link. The blank score just means that the organizers didn't collect the annotations for the particular sentence pair.

@goodmami @tteofili @icedwater Thank you for accepting the invitation! @flammie Tommi! Definitely glad to have you join us for the workshop! Sending you the invitation email =)

@jnothman @amittai @shamilcm @vene @shicode @geetickachauhan @alinepaes @ivanvladimir @svakulenk0 Thank you for accepting the invitation!! @ggdupont Thank you for the points on the tagline and your interest! We've sent you...

Thank you again for the confirmation @bethard @dennybritz @almoslmi @ggdupont @mhagiwara @yvespeirsman ! Sorry about the confusion, starring the repo worked for the first edition after that we overlooked that...

Thanks Lucy!! Let me try to translate that with my horrible Korean and with the help of MT =) ``` Are you using a lot of open source while developing...

Expected behavior for zh and ko: ``` $ echo "记者 应谦 美国" | ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l zh Tokenizer Version 1.1 Language: zh Number of threads: 1 记者 应谦 美国 $ echo...

Looks like it's the the `unichars` list and the `perluniprops` list of Alphanumeric is a little different. The issue comes from https://github.com/alvations/sacremoses/blob/master/sacremoses/tokenize.py#L420 where the non-alphanumeric characters are padded with spaces....