graph4nlp How to create data.pt, vocab.pt, and label.pt for custom dataset?

❓ Questions and Help

I have follow your tutorial on how to using graph4nlp for text classification using the trec data. So i try to create custom dataset for my imdb sentiment analysis. I have create all the .txt files both train and test. After that i change the .yaml file, create the custom dataset using Text2LabelDataset, and run the same code. Everything is fine but in the process it take very long time and still not start the training process. Another things that i consider is there are three files that i am still not have in the ./data/imdb/processed/dependency_graph that is data.pt, vocab.pt, and label.pt. How i can get these three files?

Aug 03 '22 02:08 firqaaa

Can you give the statistics such as sample amount, and mean token number in a sentence of your corpus?

Aug 03 '22 14:08 AlanSwift

what you mean by sample amount? is it a train test split? the mean of token number is 231

Aug 03 '22 16:08 firqaaa

For both train and test split.

Aug 03 '22 16:08 AlanSwift

I have 60k data and divided each train and test split by half equally, so train.txt and test.txt have 30k each

Aug 03 '22 16:08 firqaaa

@AlanSwift btw, can you give me an example code to preprocess trec dataset so that the trec dataset have all the .pt files?

Aug 03 '22 18:08 firqaaa

I think the input sentence is too long. So the StanfordCoreNLP may be slow. You can add some prompt to see where the program hung.

Aug 04 '22 12:08 AlanSwift

So, i just need to truncated the sentences?

Aug 05 '22 04:08 firqaaa

I think you had better check whether StanfordCoreNLP runs successfully. For example, you can input "top" in your terminal and find the java process.

Aug 05 '22 05:08 AlanSwift