openai-cookbook Document Library Pre-Processing

trafficstars

Hello all,

Would it be at all possible to provide an example of document pre-processing where the dataset is not being imported from Wikipedia, instead through an individualized standard csv file?

For no less than 72 hours now over the past week I've been trying to complete the question and answering tutorial (https://github.com/openai/openai-cookbook/blob/main/examples/Question_answering_using_embeddings.ipynb) using my own dataset. The issue is even though I downloaded the example csv file, and copied my own data into the csv file and re-saved it, I cannot get the dataset to run using the code. The code runs perfectly fine if I use the sample data set but when I try to run the sample data set with my data replacing it (even ensuring it is always saved as a csv file), it always errors past line 48. I have tried changing the data types in the columns using python, I've tried removing any special characters, I've tried, I kid you not, about three days worth of fixes with no luck. ChatGPT is now repeating recommendations without any success unfortunately.

I continually receive this error:

`ValueError Traceback (most recent call last) Cell In [74], line 1 ----> 1 prompt = construct_prompt( 2 "What is a WOC Nurse?", 3 document_embeddings, 4 df 5 ) 7 print("===\n", prompt)

Cell In [73], line 16, in construct_prompt(question, context_embeddings, df) 13 document_section = df.loc[section_index] 15 chosen_sections_len += document_section.tokens + separator_len ---> 16 if chosen_sections_len > MAX_SECTION_LEN: 17 break 19 chosen_sections.append(SEPARATOR + document_section.content.replace("\n", " "))

File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/generic.py:1442, in NDFrame.nonzero(self) 1440 @final 1441 def nonzero(self): -> 1442 raise ValueError( 1443 f"The truth value of a {type(self).name} is ambiguous. " 1444 "Use a.empty, a.bool(), a.item(), a.any() or a.all()." 1445 )

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). `

You can see in the dataset found here: https://docs.google.com/spreadsheets/d/e/2PACX-1vSs9Ok5FUrhAOu_BnpLwV63bwpLylRtUWBDE7onAX1zrZW0Sz4gBEtBN-KtsBiC1DhKyhhZjNXfNf0i/pub?output=csv

That if you only use the first chapter, there is no issue, however, anything read past line 48 (it took a lot of trial and error to determine this) it no longer works and I either get the error noted above, or an error stating that the system cannot read the JSON content.

My assumption would be the issue is the way in which I tokenized the data or there is an issue with the content of the dataset, however you can see line 48 is only a standard paragraph with nothing special in it. Unfortunately, I am still quite new to python so any recommendations or assistance with this issue would be much appreciated. I'm about to give up trying to figure out how to use my own dataset with OpenAI to do question and answer embedding, which is quite unfortunate.

Thank you so much for your assistance!

Jan 16 '23 23:01 CoreyRN

I unfortunately don't have time to help you debug, but I can offer a few pointers. The error about the truth value of a series being ambiguous makes it sound like document_section.tokens has the type Series rather than integer. This could happen as a result of document_section is somehow being a dataframe instead of a row, meaning that grabbing the column is giving you a series rather than a value.

My advice to try printing out section_index to verify that it's an integer and document_section to verify that it's the row of a dataframe. If they don't have those types, you might be able to follow the trail upstream to see what's going wrong.

Jan 17 '23 05:01 ted-at-openai

Thanks for your reply Ted. I know your a busy guy, I think given that someone else just explained that they are also having issues with dataset preprocessing and getting it to work with the OpenAI system (https://github.com/openai/openai-cookbook/issues/76) I feel like that document you mentioned you would work on outlining the process would be very beneficial. Even if you don't assist with debugging my specific case, a general overview of the pre-processing process for personal datasets or a csv dataset would be beneficial instead of importing from wikipedia.

Appreciate the response.

Jan 17 '23 14:01 CoreyRN

Thanks Ted, signing off, have a nice night!

Jan 18 '23 22:01 zackperine

I had the same problem, I modified the code as follows to fix the issue (PS I'm a Python newbie so this might not be the best solution):

`def construct_prompt(question: str, context_embeddings: dict, df: pd.DataFrame) -> str: """ Fetch relevant """ most_relevant_document_sections = order_document_sections_by_query_similarity(question, context_embeddings)

chosen_sections = []
chosen_sections_len = 0
chosen_sections_indexes = []
     
for _, section_index in most_relevant_document_sections:
    # Add contexts until we run out of space.        
    # document_section = df.loc[section_index]
    document_section = df.loc[section_index].content[0] #added the last part

    
    # chosen_sections_len += document_section.tokens + separator_len
    chosen_sections_len += len(encoding.encode(document_section)) + separator_len

    if chosen_sections_len > MAX_SECTION_LEN:
        break
        
    # chosen_sections.append(SEPARATOR + document_section.content.replace("\n", " "))
    chosen_sections.append(SEPARATOR + document_section.replace("\n", " "))
    chosen_sections_indexes.append(str(section_index))`

Feb 21 '23 06:02 eu400000

@ted-at-openai : In your Question_answering_using_embeddings notebook Section 2) Find the most similar document embeddings to the question embedding, you have:

In [11]    order_document_sections_by_query_similarity("Who won the men's high jump?", document_embeddings)
Out [11]   [(0.884864308450606, ("Athletics at the 2020 Summer Olympics – Men's high jump", 'Summary'))]

Correct? Then how can,

most_relevant_document_sections = order_document_sections_by_query_similarity(question, context_embeddings)
for _, section_index in most_relevant_document_sections:
        document_section = df.loc[section_index]

return section_index as an Integer as you mentioned above in your comment? Am I missing something here?

See:

most_relevant_document_sections = [(0.884864308450606, ("Athletics at the 2020 Summer Olympics – Men's high jump", 'Summary'))]
for _, section_index in most_relevant_document_sections:
        print(section_index)
        
Out[]    ("Athletics at the 2020 Summer Olympics – Men's high jump", 'Summary')

I think for Python 3.9.xx, section_index is returned as a string and that's what's yielding a KeyError error. It's most likely a bug unless I'm understanding differently than intended. @eu400000 's modification works as a hack though.

Mar 14 '23 08:03 justanotherlad

I will take a look at this. Thanks for flagging. I believe it worked for me (also Python 3.9), but I will do a rewrite for gpt-3.5 and verify the correct behavior for the updated version.

Mar 21 '23 21:03 ted-at-openai

The rewrite was completed in April. Will close this issue.

Jun 21 '23 16:06 ted-at-openai

openai-cookbook openai-cookbook copied to clipboard

Document Library Pre-Processing

openai-cookbook
openai-cookbook copied to clipboard