private-gpt
private-gpt copied to clipboard
add JSON source-document support
Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] Its not always easy to convert json documents to csv (when there is nesting or arbitrary arrays of objects involved), so its not just a question of converting json data to csv. JSON to Markdown might work a little better but still seems dirty. Extracting data from the JSON model to upload as a simple text seems a crime since all the context and positional relationships are lost. It is actually a bit surprising the json format isn't natively supported and csv is. Hopefully not a huge lift.
Describe the solution you'd like add the JSON file format as a natively/directly supported source-document. (as documented in: https://github.com/imartinez/privateGPT#instructions-for-ingesting-your-own-dataset )
Describe alternatives you've considered converting JSON to some other format like Markdown or CSV neither of which is super clean or possible.
Additional context Add any other context or screenshots about the feature request here.
Maybe you're referring to a langchain agent who can read JSON?
or maybe he is refering to https://python.langchain.com/en/latest/modules/indexes/document_loaders/examples/json.html
you upload content and metadata quite easely
Hi guys, I was referring to this part of the documentation:
https://github.com/imartinez/privateGPT#instructions-for-ingesting-your-own-dataset
It supports several ways of importing data from files including CSV, PDF, HTML, MD etc. but JSON is not on the list of documents that can be ingested. It seems JSON is missing from that list given that CSV and MD are supported and JSON is somewhat adjacent to those data formats.
If not possible to ingest JSON data in this way, perhaps updating this section of the documentation (https://github.com/imartinez/privateGPT#instructions-for-ingesting-your-own-dataset ) with a "* to ingest JSON data utilize the langchain agent, or JSON document loader." note?
@mikee-gwu try to experiment with these, maybe it helps
from langchain.document_loaders import (
DirectoryLoader,
TextLoader,
JSONLoader,
)
def load_json():
# loader = DirectoryLoader("source_documents", glob='**/*.json', show_progress=True, loader_cls=TextLoader)
# or maybe with JSONLoader
loader = DirectoryLoader(
"source_documents", glob='**/*.json', show_progress=True, loader_cls = JSONLoader,
loader_kwargs = {'jq_schema':'.content'}
)
documents = loader.load()
print(f'document count: {len(documents)}')
print(documents[0] if len(documents) > 0 else None)
Hey Mike it's possible to do it :
from langchain.document_loaders import (
CSVLoader,
JSONLoader,
and
LOADER_MAPPING = {
".csv": (CSVLoader, {}),
# Add more mappings for other file extensions and loaders as needed
".json": (JSONLoader, {"jq_schema":".[].full_text"}),
}
you can also tune to have some metadata
( quite similar to @sime2408 solution btw )
thanks, I will experiment with these options as a way to import JSON data -- really appreciate the advice.
In terms of this ticket as an RFE request, does the following still hold?
Describe the solution you'd like: add the JSON file format as a natively/directly supported source-document. (as documented in: https://github.com/imartinez/privateGPT#instructions-for-ingesting-your-own-dataset )
@mikee-gwu did that work? Were you able to load the JSON data?
I was not able to get JSON loader to work with my schema. So wrote a small util to convert JSON to CSV and uploaded the CSV file.
i was able to import a JSON file
@borel, can you please share your code snippet?
@borel can you please share your code snippet?
I got the error:
ValueError: Expected page_content is string, got <class 'NoneType'> instead. Set text_content=False
if the desired input for page_content
is not a string