SemEval-2017-Task-5 icon indicating copy to clipboard operation
SemEval-2017-Task-5 copied to clipboard

Unable to find files

Open JayaLekhrajani opened this issue 8 years ago • 11 comments
trafficstars

Hi Sudipta,

I am unable to find DATA_FILES_LIST, ORIGINAL_DATA_DIR, RAW_DATA_PATH in config.py file. Also, what is mb_train_trial_test_new_prs.csv file for? The training and test data is in json format.

JayaLekhrajani avatar Oct 31 '17 03:10 JayaLekhrajani

Also, you have mentioned "from features import lexical, syntactic, writing_density, sentiments, embeddings, generic_field_vectorizer" in extract_features.py, but I could not find all the files in features folder. Let me know if I am missing out on something.

JayaLekhrajani avatar Oct 31 '17 18:10 JayaLekhrajani

This source code is a part of a large project structure. The provided codes are basically for showing the deep learning architectures. This codebase "is not" a running system. The config.py is used to keep track of filepaths and other feature combinations in the project. So you can easily ignore the irrelevant things. You have to put your own files and change the paths. About the features, lexical, embeddings, and sentiment features are relevant. So just ignore the rest. Senticnet features are extracted during the preprocessing step and that code is in prepare_data.

For the project, the raw data files from the organizers were preprocessed first. Then the experiments were run. So there were several preprocessed files. You have to code to generate them. Not the actual files. Thanks

cryptexcode avatar Oct 31 '17 18:10 cryptexcode

Thank You Sudipta for clarifying most of my doubts. I am still little confused about how you got the following files: headline_train_trial_test.csv mb_train_trial_test_new_prs.csv

For the first one, did you get csv file after merging train, trial and test json files?

JayaLekhrajani avatar Oct 31 '17 19:10 JayaLekhrajani

Exactly. All the data were merged into a single csv for easy manipulation.

cryptexcode avatar Oct 31 '17 19:10 cryptexcode

Hi Sudipta, Sorry, to bother you again. But while merging the three json files, what values did you use for sentiment score of test data?

JayaLekhrajani avatar Nov 01 '17 21:11 JayaLekhrajani

Hi, if you go through the paper you will get the idea about the process. We used senticnet.

cryptexcode avatar Nov 01 '17 21:11 cryptexcode

Hi Sudipta, I read your paper and it has clarified most of my doubts. The only doubts that I still have are: (a)doc_to_sequence_csv module has not been used for microblogs data? (b)SenticConceptsTfidfVectorizer has been defined in sentiments module of features package. But it is not there in the repository. Senticnet features were extracted during pre-processing step. How did you create SenticConceptsTfidfVectorizer ?

JayaLekhrajani avatar Nov 02 '17 17:11 JayaLekhrajani

a) We have used all the things with two versions. One for microblogs and another for the headlines. So one code is used to generate processed data for both dataset. b) We finally didn't use it as we created the concept vectors during the preprocessing step. But if you want to use that, you can use the simple tf-idf vectorizer, as it will be modeled as bag of concepts.

Hope it helps. All the best.

cryptexcode avatar Nov 03 '17 23:11 cryptexcode

Hi Sudipta,

Thank You for clarifying my doubts. You have been really very helpful. There is still one doubt that I was trying to fix on my own, but I couldn't. When you run the model and invoke the function: pack_data_to_format(), I get the following error, and I am unable to find out a fix. [image: Inline image 2]

https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=icon Virus-free. www.avast.com https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=link <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

On Fri, Nov 3, 2017 at 7:36 PM, Sudipta Kar [email protected] wrote:

a) We have used all the things with two versions. One for microblogs and another for the headlines. So one code is used to generate processed data for both dataset. b) We finally didn't use it as we created the concept vectors during the preprocessing step. But if you want to use that, you can use the simple tf-idf vectorizer, as it will be modeled as bag of concepts.

Hope it helps. All the best.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/cryptexcode/SemEval-2017-Task-5/issues/1#issuecomment-341852288, or mute the thread https://github.com/notifications/unsubscribe-auth/ATPk0Hb5Gttp2muAxyWQZuZ2CRiitDfEks5sy6NigaJpZM4QMLB2 .

JayaLekhrajani avatar Nov 04 '17 22:11 JayaLekhrajani

Hi Sudipta,

You mentioned that 'About the features, lexical, embeddings, and sentiment features are relevant.' But I can only see embeddings.py & lexical.py under the features directory. Is there a sentiments.py or I can just drop all missing modules?

Thanks.

leckie-chn avatar Dec 02 '17 05:12 leckie-chn

The sentiment features were extracted code in the preprocessing. The code was done in hurry, so not exactly structured.

cryptexcode avatar Dec 31 '17 01:12 cryptexcode