private-gpt add JSON source-document support

Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] Its not always easy to convert json documents to csv (when there is nesting or arbitrary arrays of objects involved), so its not just a question of converting json data to csv. JSON to Markdown might work a little better but still seems dirty. Extracting data from the JSON model to upload as a simple text seems a crime since all the context and positional relationships are lost. It is actually a bit surprising the json format isn't natively supported and csv is. Hopefully not a huge lift.

Describe the solution you'd like add the JSON file format as a natively/directly supported source-document. (as documented in: https://github.com/imartinez/privateGPT#instructions-for-ingesting-your-own-dataset )

Describe alternatives you've considered converting JSON to some other format like Markdown or CSV neither of which is super clean or possible.

Additional context Add any other context or screenshots about the feature request here.

May 23 '23 21:05 mikee-gwu

Maybe you're referring to a langchain agent who can read JSON?

May 28 '23 17:05 sime2408

or maybe he is refering to https://python.langchain.com/en/latest/modules/indexes/document_loaders/examples/json.html

you upload content and metadata quite easely

Jun 02 '23 13:06 borel

Hi guys, I was referring to this part of the documentation:

https://github.com/imartinez/privateGPT#instructions-for-ingesting-your-own-dataset

It supports several ways of importing data from files including CSV, PDF, HTML, MD etc. but JSON is not on the list of documents that can be ingested. It seems JSON is missing from that list given that CSV and MD are supported and JSON is somewhat adjacent to those data formats.

If not possible to ingest JSON data in this way, perhaps updating this section of the documentation (https://github.com/imartinez/privateGPT#instructions-for-ingesting-your-own-dataset ) with a "* to ingest JSON data utilize the langchain agent, or JSON document loader." note?

Jun 02 '23 13:06 mikee-gwu

@mikee-gwu try to experiment with these, maybe it helps

from langchain.document_loaders import (
    DirectoryLoader, 
    TextLoader,
    JSONLoader,
)

def load_json():
    # loader = DirectoryLoader("source_documents", glob='**/*.json', show_progress=True, loader_cls=TextLoader)
    # or maybe with JSONLoader
    loader = DirectoryLoader(
        "source_documents", glob='**/*.json', show_progress=True, loader_cls = JSONLoader, 
                             loader_kwargs = {'jq_schema':'.content'}
    )
    
    documents = loader.load()
    print(f'document count: {len(documents)}')
    print(documents[0] if len(documents) > 0 else None)

Jun 02 '23 14:06 sime2408

Hey Mike it's possible to do it :

from langchain.document_loaders import (
    CSVLoader,
    JSONLoader,

and

LOADER_MAPPING = {
    ".csv": (CSVLoader, {}),
    # Add more mappings for other file extensions and loaders as needed
     ".json": (JSONLoader, {"jq_schema":".[].full_text"}),
}

you can also tune to have some metadata

( quite similar to @sime2408 solution btw )

Jun 02 '23 14:06 borel

thanks, I will experiment with these options as a way to import JSON data -- really appreciate the advice.

In terms of this ticket as an RFE request, does the following still hold?

Describe the solution you'd like: add the JSON file format as a natively/directly supported source-document. (as documented in: https://github.com/imartinez/privateGPT#instructions-for-ingesting-your-own-dataset )

Jun 02 '23 14:06 mikee-gwu

@mikee-gwu did that work? Were you able to load the JSON data?

Jun 03 '23 06:06 Anoupz

I was not able to get JSON loader to work with my schema. So wrote a small util to convert JSON to CSV and uploaded the CSV file.

Jun 08 '23 13:06 Anoupz

i was able to import a JSON file

Jun 08 '23 13:06 borel

@borel, can you please share your code snippet?

Jun 08 '23 13:06 Anoupz

@borel can you please share your code snippet?

Jul 10 '23 14:07 hakankarakaya

I got the error: ValueError: Expected page_content is string, got <class 'NoneType'> instead. Set text_content=False if the desired input for page_content is not a string

Aug 31 '23 04:08 edwinjosechittilappilly

private-gpt private-gpt copied to clipboard

add JSON source-document support

private-gpt
private-gpt copied to clipboard