robertuito icon indicating copy to clipboard operation
robertuito copied to clipboard

Will the corpus be openly published?

Open avacaondata opened this issue 3 years ago • 3 comments

Given how good this model works, I'd be interested in having access to the original corpus with which it was trained. Will that be possible? I mean, is there any plan for uploading it to huggingface/datasets or publish it in any other form?

Thank you very much in advance :)

avacaondata avatar Feb 21 '22 11:02 avacaondata

@alexvaca0 Thanks for your interest! We will be publishing the original tweets soon, hopefully in datasets. Leave this issue open so we let you know when they are available.

finiteautomata avatar Feb 21 '22 11:02 finiteautomata

Hi @alexvaca0. I'm having some problems regarding the original tweets -- that is, the raw tweets prior to any preprocessing and filtering. The machine which contained this data is not turning on, let's hope the disk is ok.

In the meanwhile, I have access to the preprocessed and filtered tweets (as described on the paper). If that's useful for you, send me an email and I'll give access to them.

I leave this issue open until we are able to publish the original data.

finiteautomata avatar Mar 06 '22 21:03 finiteautomata

Oh that would be so great, if it is still possible to have access to the tweets... thank you very much :)

avacaondata avatar Jul 01 '22 10:07 avacaondata

Well, this is quite late, but finally, the tweets were released. I could only upload half of them, but I suppose this might be enough (~300M tweets).

Check https://huggingface.co/datasets/pysentimiento/spanish-tweets

In the following days, I will be uploading the rest of them.

finiteautomata avatar Nov 30 '22 23:11 finiteautomata