haystack
haystack copied to clipboard
The result from QuestionGenerator cannot be put out to json
Describe the bug
The output of result = question_generation_pipeline.run(documents=[document])
cannot be written straight to json. I believe the error given has something to do with a Document type being somewhere in this structure. It is a list of dict which seems like it should be convertable to json without much trouble.
Error message json.dump gives a message about Document not being Serializable or something of that sort
Expected behavior A clear, easy way to output as a json file
Additional context Add any other context about the problem here, like document types / preprocessing steps / settings of reader etc.
To Reproduce Steps to reproduce the behavior
FAQ Check
- [ ] Have you had a look at our new FAQ page?
System:
- OS: Win 10
- GPU/CPU: GPU
- Haystack version (commit or version number): 1.20.0
- DocumentStore: InMemoryDocumentStore
- Reader:
- Retriever:
Should be noted, the following is a workaround which writes things this disk (though maybe not fully).
with open('crazy.txt', 'w', encoding='utf-8') as fs: for q in questions: fs.write(str(q))
Hey @demongolem-biz2, the actual error message could help us fix this issue. Can you share a stacktrace?
I don't currently have a stack trace available, but perhaps with some sample output. Here is the string representation which shows what is going on. The "documents" has a Document object which is where the problem is arising. It looks like Document is not serializable. And I see the two None values inside might be the problem? How would the result look like if we had None values? Would it simply be null and how and where would that conversion take place?
{ "generated_questions": [{ "document_id": "f2650b87c0540a143db8b56cc9468d1e", "document_sample": "Label A", "questions": ["What does Label A contain?"] }], "documents": [ < Document: { "content": "Label A", "content_type": "text", "score": None, "meta": { "Unnamed: 0": 0, "label": "Label_A.txt_0" }, "id_hash_keys": ["content"], "embedding": None, "id": "f2650b87c0540a143db8b56cc9468d1e" } > ], "root_node": "Query", "params": {}, "node_id": "QuestionGenerator" }
Hey thanks for opening the issue, maybe I can help: why do you need this feature? Do you want to save the results or do you want to use it in a REST API? The Document class has to_dict() str() methods, they should help you serialize it.
Apart from that, the QuestionGenerator is rather outdated, I would use LLMs for this. LLMs work much much better at generating questions, see for example a more general approach to data generation here: https://arxiv.org/abs/2309.09582 Here you can see the usage of LLMS for question generation: https://prompthub.deepset.ai/?prompt=deepset%2Fconditioned-question-generation
Thank you for both pieces.
On the first paragraph, this is the case which I think ought to be a little easier. Consider this code snippet:
result = rqg_pipeline.run(query='Wave Motion')
# this yields dictionary
print(type(result))
# this is how tutorials would have you display results
print_questions(result)
# this is what I want to do, but there are Document objects buried in the dictionary
json_object = json.dumps(result, indent = 4)
This result dict that is returned from the RetrieverQuestionGeneratinPipeline, what is the easiest way to make the entire dict serializable. I don't fancy iterating over all keys to see where the Document objects are buried
Ok to be fair and to shy away from hand waving, the following line would jsonize the results for me:
result['documents'][:] = [dment.to_dict() for dment in result['documents']]
After this, the output of RetrieverQuestionGenerationPipeline is pure json and can be manipulated with any json library. Seems a little too dirty, but it works.
Thanks for finding a quick fix.
If you want please open a PR in the QuestionGenerator. Having said this, using LLMs for it is the much better approach currently and we are also working on haystack 2.0, so Im not sure if the QuestionGenerator will end up in version 2.x in its current form.
"Here you can see the usage of LLMS for question generation: https://prompthub.deepset.ai/?prompt=deepset%2Fconditioned-question-generation"
I see two options in the PromptHub: conditioned-question-generation and question-generation. Are either one ok for generative question generation? My use case is given a number of documents produce a question (or maybe multiple questions) for each document. I don't see how the conditioned-question-generation would work in that you already need to know the answers which is part of the problem I have to generate good answers. I have already written code and run question-generation.
If nothing quite pans out, I will do as suggested and try to fine-tune a model externally for question generation. I don't see why question generation has to necessarily be a prompt-based LLM, I am just looking to create the most important questions using whatever methodology which I am sure is a difficult problem to tackle question quality (and maybe just as difficult to define).