torchMoji
torchMoji copied to clipboard
tweets.2016-09-01 dataset
""" Creates a vocabulary from a tsv file.
"""
import codecs
import example_helper
from torchmoji.create_vocab import VocabBuilder
from torchmoji.word_generator import TweetWordGenerator
with codecs.open('../../twitterdata/tweets.2016-09-01', 'rU', 'utf-8') as stream:
wg = TweetWordGenerator(stream)
vb = VocabBuilder(wg)
vb.count_all_words()
vb.save_vocab()
In this code, in oder to create a vocabulary, you had been used '../../twitterdata/tweets.2016-09-01'
dataset. But where I will find this dataset? Please let me know.
Please share this dataset with my mail [email protected], if it is possible.
Hello,have you solved this problem?
@KingS770234358 , This issue is not solved yet.
@rezwanh001 as the huggingface mentioned in the readme file,the code in the 'script' folder are used to process the raw data in the folder ‘data'. I think 'tweets.2016-09-01' may be the result of processing.
Maybe you should run the script 'convert_all_datasets.py' in the 'script' folder.
@KingS770234358 I tried running that script. Ran into this error.
Converting Olympic
-- Generating ../data/Olympic/own_vocab.pickle
done. Coverage: 0.030899113550021062
-- Generating ../data/Olympic/twitter_vocab.pickle
done. Coverage: 0.8874630645842128
-- Generating ../data/Olympic/combined_vocab.pickle
Traceback (most recent call last):
File "/Users/avij1/Desktop/imp_shit/torchMoji/scripts/convert_all_datasets.py", line 88, in <module>
data = pickle.load(dataset, fix_imports=True,encoding='utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf0 in position 6: invalid continuation byte
done. Coverage: 0.8874630645842128
Converting PsychExp