graph4nlp icon indicating copy to clipboard operation
graph4nlp copied to clipboard

How to create data.pt, vocab.pt, and label.pt for custom dataset?

Open firqaaa opened this issue 3 years ago • 8 comments

❓ Questions and Help

I have follow your tutorial on how to using graph4nlp for text classification using the trec data. So i try to create custom dataset for my imdb sentiment analysis. I have create all the .txt files both train and test. After that i change the .yaml file, create the custom dataset using Text2LabelDataset, and run the same code. Everything is fine but in the process it take very long time and still not start the training process. Another things that i consider is there are three files that i am still not have in the ./data/imdb/processed/dependency_graph that is data.pt, vocab.pt, and label.pt. How i can get these three files?

firqaaa avatar Aug 03 '22 02:08 firqaaa

Can you give the statistics such as sample amount, and mean token number in a sentence of your corpus?

AlanSwift avatar Aug 03 '22 14:08 AlanSwift

what you mean by sample amount? is it a train test split? the mean of token number is 231

firqaaa avatar Aug 03 '22 16:08 firqaaa

For both train and test split.

AlanSwift avatar Aug 03 '22 16:08 AlanSwift

I have 60k data and divided each train and test split by half equally, so train.txt and test.txt have 30k each

firqaaa avatar Aug 03 '22 16:08 firqaaa

@AlanSwift btw, can you give me an example code to preprocess trec dataset so that the trec dataset have all the .pt files?

firqaaa avatar Aug 03 '22 18:08 firqaaa

I think the input sentence is too long. So the StanfordCoreNLP may be slow. You can add some prompt to see where the program hung.

AlanSwift avatar Aug 04 '22 12:08 AlanSwift

So, i just need to truncated the sentences?

firqaaa avatar Aug 05 '22 04:08 firqaaa

I think you had better check whether StanfordCoreNLP runs successfully. For example, you can input "top" in your terminal and find the java process.

AlanSwift avatar Aug 05 '22 05:08 AlanSwift