torchMoji tweets.2016-09-01 dataset

""" Creates a vocabulary from a tsv file.
"""

import codecs
import example_helper
from torchmoji.create_vocab import VocabBuilder
from torchmoji.word_generator import TweetWordGenerator

with codecs.open('../../twitterdata/tweets.2016-09-01', 'rU', 'utf-8') as stream:
    wg = TweetWordGenerator(stream)
    vb = VocabBuilder(wg)
    vb.count_all_words()
    vb.save_vocab()

In this code, in oder to create a vocabulary, you had been used '../../twitterdata/tweets.2016-09-01' dataset. But where I will find this dataset? Please let me know. Please share this dataset with my mail [email protected], if it is possible.

Dec 01 '19 15:12 rezwanh001

Hello,have you solved this problem?

Jan 02 '20 08:01 KingS770234358

@KingS770234358 , This issue is not solved yet.

Jan 02 '20 16:01 rezwanh001

@rezwanh001 as the huggingface mentioned in the readme file,the code in the 'script' folder are used to process the raw data in the folder ‘data'. I think 'tweets.2016-09-01' may be the result of processing.

Jan 02 '20 16:01 KingS770234358

Maybe you should run the script 'convert_all_datasets.py' in the 'script' folder.

Jan 02 '20 16:01 KingS770234358

@KingS770234358 I tried running that script. Ran into this error.

Converting Olympic
-- Generating ../data/Olympic/own_vocab.pickle 
     done. Coverage: 0.030899113550021062
-- Generating ../data/Olympic/twitter_vocab.pickle 
     done. Coverage: 0.8874630645842128
-- Generating ../data/Olympic/combined_vocab.pickle 
Traceback (most recent call last):
  File "/Users/avij1/Desktop/imp_shit/torchMoji/scripts/convert_all_datasets.py", line 88, in <module>
    data = pickle.load(dataset, fix_imports=True,encoding='utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf0 in position 6: invalid continuation byte
     done. Coverage: 0.8874630645842128
Converting PsychExp

Feb 16 '20 09:02 anuragvij264

torchMoji torchMoji copied to clipboard

tweets.2016-09-01 dataset

torchMoji
torchMoji copied to clipboard