torchMoji icon indicating copy to clipboard operation
torchMoji copied to clipboard

tweets.2016-09-01 dataset

Open rezwanh001 opened this issue 5 years ago • 5 comments

""" Creates a vocabulary from a tsv file.
"""

import codecs
import example_helper
from torchmoji.create_vocab import VocabBuilder
from torchmoji.word_generator import TweetWordGenerator

with codecs.open('../../twitterdata/tweets.2016-09-01', 'rU', 'utf-8') as stream:
    wg = TweetWordGenerator(stream)
    vb = VocabBuilder(wg)
    vb.count_all_words()
    vb.save_vocab()

In this code, in oder to create a vocabulary, you had been used '../../twitterdata/tweets.2016-09-01' dataset. But where I will find this dataset? Please let me know. Please share this dataset with my mail [email protected], if it is possible.

rezwanh001 avatar Dec 01 '19 15:12 rezwanh001

Hello,have you solved this problem?

KingS770234358 avatar Jan 02 '20 08:01 KingS770234358

@KingS770234358 , This issue is not solved yet.

rezwanh001 avatar Jan 02 '20 16:01 rezwanh001

@rezwanh001 as the huggingface mentioned in the readme file,the code in the 'script' folder are used to process the raw data in the folder ‘data'. I think 'tweets.2016-09-01' may be the result of processing.

KingS770234358 avatar Jan 02 '20 16:01 KingS770234358

Maybe you should run the script 'convert_all_datasets.py' in the 'script' folder.

KingS770234358 avatar Jan 02 '20 16:01 KingS770234358

@KingS770234358 I tried running that script. Ran into this error.

Converting Olympic
-- Generating ../data/Olympic/own_vocab.pickle 
     done. Coverage: 0.030899113550021062
-- Generating ../data/Olympic/twitter_vocab.pickle 
     done. Coverage: 0.8874630645842128
-- Generating ../data/Olympic/combined_vocab.pickle 
Traceback (most recent call last):
  File "/Users/avij1/Desktop/imp_shit/torchMoji/scripts/convert_all_datasets.py", line 88, in <module>
    data = pickle.load(dataset, fix_imports=True,encoding='utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf0 in position 6: invalid continuation byte
     done. Coverage: 0.8874630645842128
Converting PsychExp

anuragvij264 avatar Feb 16 '20 09:02 anuragvij264