cdQA icon indicating copy to clipboard operation
cdQA copied to clipboard

Using My own dataset with csv

Open aqsa27 opened this issue 5 years ago • 14 comments

Hi, I am trying to build a cdqa with my customized dataset which is in CSV. Can you tell me what format should my dataset be? and there is only a pdf converter for csv. is there any way of converting my dataset into the acceptable cdqa dataframe?

aqsa27 avatar Oct 24 '19 18:10 aqsa27

title paragraphs
The Article Title [Paragraph 1 of Article, ... , Paragraph N of Article]

ghost avatar Oct 25 '19 00:10 ghost

Is there any automated way to convert the data into this format?

aqsa27 avatar Oct 25 '19 01:10 aqsa27

https://github.com/cdqa-suite/cdQA/blob/88a1ff2bb249f24edc427737ccb0b8f8959cf0b6/cdqa/scrapper/bs4_bnpp_newsroom.py This is the script they have used. It's a good starting point.

ghost avatar Oct 25 '19 01:10 ghost

I will try with this

aqsa27 avatar Oct 25 '19 12:10 aqsa27

The convertors used for pdf does not read in my file, is there any format for the pdf file as well?

aqsa27 avatar Nov 01 '19 15:11 aqsa27

Even I want to do the same. Kindly help on this.

swebalaji avatar Nov 18 '19 11:11 swebalaji

Hi,

Unfortunately, our pdf_converter does not generalize well, I will be working on a solution to that soon. For now, I advise you to try to use other libraries to convert your pdf into text, such as pdfminer, and do some preprocessing to build the dataframe with the format presented in the readme.

andrelmfarias avatar Nov 22 '19 10:11 andrelmfarias

@aqsa27 how does your csv look like? Can you share the format or a sample here?

fmikaelian avatar Nov 23 '19 11:11 fmikaelian

@aqsa27 ,@fmikaelian - I would also like to have a look at the csv. Can you please share a sample ? Also once we build the csv , how do you train the model ? If I use the existing QAPipeline :

cdqa_pipeline = QAPipeline(reader='./models/bert_qa_vCPU-sklearn.joblib') cdqa_pipeline.fit_retriever(df=df)

the results are not matching the questions asked while testing. Any pointers on training the model ?

nayakvidya avatar Nov 25 '19 06:11 nayakvidya

@aqsa27 how does your csv look like? Can you share the format or a sample here?

Hi,

My dataset contains 4 columns, like question, answer, date and additional information.

aqsa27 avatar Nov 25 '19 16:11 aqsa27

@aqsa27 ,@fmikaelian - I would also like to have a look at the csv. Can you please share a sample ? Also once we build the csv , how do you train the model ? If I use the existing QAPipeline :

cdqa_pipeline = QAPipeline(reader='./models/bert_qa_vCPU-sklearn.joblib') cdqa_pipeline.fit_retriever(df=df)

the results are not matching the questions asked while testing. Any pointers on training the model

I create a new dataframe of my csv and use that to train my model cdqa_pipeline = QAPipeline(reader='./models/bert_qa_vCPU-sklearn.joblib') cdqa_pipeline.fit_retriever(df=newdf)

The answer wit this method is not 100% accurate, but its a lot more relevant

aqsa27 avatar Nov 25 '19 16:11 aqsa27

Hi,

Unfortunately, our pdf_converter does not generalize well, I will be working on a solution to that soon. For now, I advise you to try to use other libraries to convert your pdf into text, such as pdfminer, and do some preprocessing to build the dataframe with the format presented in the readme.

Hi,

can you give us a example? about the format? not the one mentioned in the readme, a live example of a csv in the recommend format

aqsa27 avatar Nov 25 '19 16:11 aqsa27

Hi, can you give us a example? about the format? not the one mentioned in the readme, a live example of a csv in the recommend format

One of our official tutorials (found in our readme and our examples repository): https://colab.research.google.com/github/cdqa-suite/cdQA/blob/master/examples/tutorial-first-steps-cdqa.ipynb

If you run this notebook the csv will be saved at the directory ./data/bnpp_newsroom_v1.1/

You can ignore the columns date, category, link, abstract. You only need title and paragraphs

andrelmfarias avatar Nov 27 '19 15:11 andrelmfarias

https://github.com/cdqa-suite/cdQA/issues/345 can you fix my issue

falcon-codz avatar Feb 28 '20 06:02 falcon-codz