llama_index
llama_index copied to clipboard
collapse JSON portions based on length
I was thinking about this problem, and instead of special casing collapsing simple lists and dictionaries, why not do it by length of the JSON?
This avoids the need of a potentially large number of flags, and allows the user to control probably the most important factor which is the length of the input.
I still have to test this. One thing I'm not 100% sure about is how well the embeddings model will handle long strings of JSON. We could de-JSONify it for the embeddings (or if we wanted to be even more sophisticated, give the de-JSONed to the embeddings model and give the JSONed version to GPT which does relatively well with JSON).
Seems like a good idea to me!
I still have to test this. One thing I'm not 100% sure about is how well the embeddings model will handle long strings of JSON. We could de-JSONify it for the embeddings (or if we wanted to be even more sophisticated, give the de-JSONed to the embeddings model and give the JSONed version to GPT which does relatively well with JSON).
Also curious/minor concern about impact on embedding and completion performance. Would be amazing if you could do a quick test with some simple data to analyze the difference in performance and report back 🚀
cc @Disiok
looks pretty good to me!
@yisding could you add a unit test in
tests/readers/test_json.py
? If that's too tricky no worries. at the very least do you have a notebook snippet showing what the output looks like?
Just pushed the test cases.
Seems like a good idea to me!
I still have to test this. One thing I'm not 100% sure about is how well the embeddings model will handle long strings of JSON. We could de-JSONify it for the embeddings (or if we wanted to be even more sophisticated, give the de-JSONed to the embeddings model and give the JSONed version to GPT which does relatively well with JSON).
Also curious/minor concern about impact on embedding and completion performance. Would be amazing if you could do a quick test with some simple data to analyze the difference in performance and report back 🚀
So I tested and JSON in general doesn't match up very well with the input question when it comes to embeddings. I think what will be needed is a similar mechanism as LLMONbrothers https://github.com/yisding/llmon where we have a pair of strings, one for matching the input question and the other for embedding in the prompt.
In any case, I don't think the idea of collapsing small pieces of JSON is a bad idea in itself, and may have other applications, so I think it's still worth merging.