cdQA
cdQA copied to clipboard
Using My own dataset with csv
Hi, I am trying to build a cdqa with my customized dataset which is in CSV. Can you tell me what format should my dataset be? and there is only a pdf converter for csv. is there any way of converting my dataset into the acceptable cdqa dataframe?
title | paragraphs |
---|---|
The Article Title | [Paragraph 1 of Article, ... , Paragraph N of Article] |
Is there any automated way to convert the data into this format?
https://github.com/cdqa-suite/cdQA/blob/88a1ff2bb249f24edc427737ccb0b8f8959cf0b6/cdqa/scrapper/bs4_bnpp_newsroom.py This is the script they have used. It's a good starting point.
I will try with this
The convertors used for pdf does not read in my file, is there any format for the pdf file as well?
Even I want to do the same. Kindly help on this.
Hi,
Unfortunately, our pdf_converter
does not generalize well, I will be working on a solution to that soon. For now, I advise you to try to use other libraries to convert your pdf into text, such as pdfminer
, and do some preprocessing to build the dataframe with the format presented in the readme.
@aqsa27 how does your csv look like? Can you share the format or a sample here?
@aqsa27 ,@fmikaelian - I would also like to have a look at the csv. Can you please share a sample ? Also once we build the csv , how do you train the model ? If I use the existing QAPipeline :
cdqa_pipeline = QAPipeline(reader='./models/bert_qa_vCPU-sklearn.joblib') cdqa_pipeline.fit_retriever(df=df)
the results are not matching the questions asked while testing. Any pointers on training the model ?
@aqsa27 how does your csv look like? Can you share the format or a sample here?
Hi,
My dataset contains 4 columns, like question, answer, date and additional information.
@aqsa27 ,@fmikaelian - I would also like to have a look at the csv. Can you please share a sample ? Also once we build the csv , how do you train the model ? If I use the existing QAPipeline :
cdqa_pipeline = QAPipeline(reader='./models/bert_qa_vCPU-sklearn.joblib') cdqa_pipeline.fit_retriever(df=df)
the results are not matching the questions asked while testing. Any pointers on training the model
I create a new dataframe of my csv and use that to train my model cdqa_pipeline = QAPipeline(reader='./models/bert_qa_vCPU-sklearn.joblib') cdqa_pipeline.fit_retriever(df=newdf)
The answer wit this method is not 100% accurate, but its a lot more relevant
Hi,
Unfortunately, our
pdf_converter
does not generalize well, I will be working on a solution to that soon. For now, I advise you to try to use other libraries to convert your pdf into text, such aspdfminer
, and do some preprocessing to build the dataframe with the format presented in the readme.
Hi,
can you give us a example? about the format? not the one mentioned in the readme, a live example of a csv in the recommend format
Hi, can you give us a example? about the format? not the one mentioned in the readme, a live example of a csv in the recommend format
One of our official tutorials (found in our readme and our examples repository): https://colab.research.google.com/github/cdqa-suite/cdQA/blob/master/examples/tutorial-first-steps-cdqa.ipynb
If you run this notebook the csv
will be saved at the directory ./data/bnpp_newsroom_v1.1/
You can ignore the columns date
, category
, link
, abstract
. You only need title
and paragraphs
https://github.com/cdqa-suite/cdQA/issues/345 can you fix my issue