cdQA Using My own dataset with csv

Hi, I am trying to build a cdqa with my customized dataset which is in CSV. Can you tell me what format should my dataset be? and there is only a pdf converter for csv. is there any way of converting my dataset into the acceptable cdqa dataframe?

Oct 24 '19 18:10 aqsa27

title paragraphs

The Article Title [Paragraph 1 of Article, ... , Paragraph N of Article]

title	paragraphs
The Article Title	[Paragraph 1 of Article, ... , Paragraph N of Article]

Oct 25 '19 00:10 ghost

Is there any automated way to convert the data into this format?

Oct 25 '19 01:10 aqsa27

https://github.com/cdqa-suite/cdQA/blob/88a1ff2bb249f24edc427737ccb0b8f8959cf0b6/cdqa/scrapper/bs4_bnpp_newsroom.py This is the script they have used. It's a good starting point.

Oct 25 '19 01:10 ghost

I will try with this

Oct 25 '19 12:10 aqsa27

The convertors used for pdf does not read in my file, is there any format for the pdf file as well?

Nov 01 '19 15:11 aqsa27

Even I want to do the same. Kindly help on this.

Nov 18 '19 11:11 swebalaji

Hi,

Unfortunately, our pdf_converter does not generalize well, I will be working on a solution to that soon. For now, I advise you to try to use other libraries to convert your pdf into text, such as pdfminer, and do some preprocessing to build the dataframe with the format presented in the readme.

Nov 22 '19 10:11 andrelmfarias

@aqsa27 how does your csv look like? Can you share the format or a sample here?

Nov 23 '19 11:11 fmikaelian

@aqsa27 ,@fmikaelian - I would also like to have a look at the csv. Can you please share a sample ? Also once we build the csv , how do you train the model ? If I use the existing QAPipeline :

cdqa_pipeline = QAPipeline(reader='./models/bert_qa_vCPU-sklearn.joblib') cdqa_pipeline.fit_retriever(df=df)

the results are not matching the questions asked while testing. Any pointers on training the model ?

Nov 25 '19 06:11 nayakvidya

@aqsa27 how does your csv look like? Can you share the format or a sample here?

Hi,

My dataset contains 4 columns, like question, answer, date and additional information.

Nov 25 '19 16:11 aqsa27

@aqsa27 ,@fmikaelian - I would also like to have a look at the csv. Can you please share a sample ? Also once we build the csv , how do you train the model ? If I use the existing QAPipeline :

cdqa_pipeline = QAPipeline(reader='./models/bert_qa_vCPU-sklearn.joblib') cdqa_pipeline.fit_retriever(df=df)

the results are not matching the questions asked while testing. Any pointers on training the model

I create a new dataframe of my csv and use that to train my model cdqa_pipeline = QAPipeline(reader='./models/bert_qa_vCPU-sklearn.joblib') cdqa_pipeline.fit_retriever(df=newdf)

The answer wit this method is not 100% accurate, but its a lot more relevant

Nov 25 '19 16:11 aqsa27

Hi,

Unfortunately, our pdf_converter does not generalize well, I will be working on a solution to that soon. For now, I advise you to try to use other libraries to convert your pdf into text, such as pdfminer, and do some preprocessing to build the dataframe with the format presented in the readme.

Hi,

can you give us a example? about the format? not the one mentioned in the readme, a live example of a csv in the recommend format

Nov 25 '19 16:11 aqsa27

Hi, can you give us a example? about the format? not the one mentioned in the readme, a live example of a csv in the recommend format

One of our official tutorials (found in our readme and our examples repository): https://colab.research.google.com/github/cdqa-suite/cdQA/blob/master/examples/tutorial-first-steps-cdqa.ipynb

If you run this notebook the csv will be saved at the directory ./data/bnpp_newsroom_v1.1/

You can ignore the columns date, category, link, abstract. You only need title and paragraphs

Nov 27 '19 15:11 andrelmfarias

https://github.com/cdqa-suite/cdQA/issues/345 can you fix my issue

Feb 28 '20 06:02 falcon-codz

cdQA cdQA copied to clipboard

Using My own dataset with csv

cdQA
cdQA copied to clipboard