How to create data.pt, vocab.pt, and label.pt for custom dataset?
❓ Questions and Help
I have follow your tutorial on how to using graph4nlp for text classification using the trec data. So i try to create custom dataset for my imdb sentiment analysis. I have create all the .txt files both train and test. After that i change the .yaml file, create the custom dataset using Text2LabelDataset, and run the same code. Everything is fine but in the process it take very long time and still not start the training process. Another things that i consider is there are three files that i am still not have in the ./data/imdb/processed/dependency_graph that is data.pt, vocab.pt, and label.pt. How i can get these three files?
Can you give the statistics such as sample amount, and mean token number in a sentence of your corpus?
what you mean by sample amount? is it a train test split? the mean of token number is 231
For both train and test split.
I have 60k data and divided each train and test split by half equally, so train.txt and test.txt have 30k each
@AlanSwift btw, can you give me an example code to preprocess trec dataset so that the trec dataset have all the .pt files?
I think the input sentence is too long. So the StanfordCoreNLP may be slow. You can add some prompt to see where the program hung.
So, i just need to truncated the sentences?
I think you had better check whether StanfordCoreNLP runs successfully. For example, you can input "top" in your terminal and find the java process.