cdQA icon indicating copy to clipboard operation
cdQA copied to clipboard

MemoryError workaround

Open nortz8 opened this issue 4 years ago • 1 comments

Kindly consider changing the def _expand_paragraphs function in the cdqa_sklearn.py file to accommodate larger datasets. Modifying the dataframe needs a lot of memory for bigger data so it would be better to set it as a list of dict before making it a dataframe.

Below is the modification I did so I would not get a MemoryError:

@staticmethod
def _expand_paragraphs(df): 
     data=[]
     for n in range(len(df)):  
         stringlist = df.iloc[n][1]  
         for m in range(len(stringlist)): 
             a=df.iloc[n][0] 
             b=stringlist[m] 
             data.append({'title' : a, 'content' : b}) 
     dfx = pd.DataFrame(data) 
     return dfx

nortz8 avatar Mar 29 '20 12:03 nortz8

Very good point. +1 @nortz8 However, your workaround did not work for me. I ended up having the following; ValueError: empty vocabulary; perhaps the documents only contain stop words

Any idea why ?

adjouama avatar Apr 29 '20 09:04 adjouama