llama_index collapse JSON portions based on length

I was thinking about this problem, and instead of special casing collapsing simple lists and dictionaries, why not do it by length of the JSON?

This avoids the need of a potentially large number of flags, and allows the user to control probably the most important factor which is the length of the input.

I still have to test this. One thing I'm not 100% sure about is how well the embeddings model will handle long strings of JSON. We could de-JSONify it for the embeddings (or if we wanted to be even more sophisticated, give the de-JSONed to the embeddings model and give the JSONed version to GPT which does relatively well with JSON).

Mar 10 '23 03:03 yisding

Seems like a good idea to me!

I still have to test this. One thing I'm not 100% sure about is how well the embeddings model will handle long strings of JSON. We could de-JSONify it for the embeddings (or if we wanted to be even more sophisticated, give the de-JSONed to the embeddings model and give the JSONed version to GPT which does relatively well with JSON).

Also curious/minor concern about impact on embedding and completion performance. Would be amazing if you could do a quick test with some simple data to analyze the difference in performance and report back 🚀

Mar 13 '23 17:03 Disiok

cc @Disiok

looks pretty good to me!

@yisding could you add a unit test in tests/readers/test_json.py? If that's too tricky no worries. at the very least do you have a notebook snippet showing what the output looks like?

Just pushed the test cases.

Mar 15 '23 14:03 yisding

Seems like a good idea to me!

I still have to test this. One thing I'm not 100% sure about is how well the embeddings model will handle long strings of JSON. We could de-JSONify it for the embeddings (or if we wanted to be even more sophisticated, give the de-JSONed to the embeddings model and give the JSONed version to GPT which does relatively well with JSON).

Also curious/minor concern about impact on embedding and completion performance. Would be amazing if you could do a quick test with some simple data to analyze the difference in performance and report back 🚀

So I tested and JSON in general doesn't match up very well with the input question when it comes to embeddings. I think what will be needed is a similar mechanism as LLMONbrothers https://github.com/yisding/llmon where we have a pair of strings, one for matching the input question and the other for embedding in the prompt.

In any case, I don't think the idea of collapsing small pieces of JSON is a bad idea in itself, and may have other applications, so I think it's still worth merging.

Mar 15 '23 14:03 yisding

llama_index llama_index copied to clipboard

collapse JSON portions based on length

llama_index
llama_index copied to clipboard