gpt4all icon indicating copy to clipboard operation
gpt4all copied to clipboard

[Feature] LocalDocs support for CSV, JSON, XML

Open tbennett6421 opened this issue 4 months ago • 5 comments

Feature Request

Please add to the roadmap for gpt4all-localdoc, the ability to parse csv, json, xml files. LLM models are prone to making garbage up, so I intended to use localdocs to provide databases of concrete items. Generally most of these formats will be in csv, json, or xml.

LocalDocs currently supports plain text files (.txt, .md, and .rst) and PDF files (.pdf).

Example use cases:

  • dumping logs into a folder, and asking questions about the data.
  • dumping databases into a folder, requesting experimental data such as (mw, mp/fp, solubility)
  • dumping financial spreadsheets, and asking questions about transcripts.
  • and more

tbennett6421 avatar Feb 29 '24 19:02 tbennett6421

Localdocs currently does not have any support for custom file parsing though this would be a nice addition.

manyoso avatar Mar 10 '24 15:03 manyoso

I concur, right now you have to rename you .csv files to .txt

btw. does anyone know what the fastest models are for this kind of thing, I'm using Nous Hermes 2 Mistral DPO right now on the txt csv file but it is kind of slow.

mishaxz avatar Mar 10 '24 22:03 mishaxz

What would it take to implement some kinda parser in the localdocs? I mean I'd be willing to look at doing a pr for it? Either in python or in c?

tbennett6421 avatar Mar 10 '24 23:03 tbennett6421

#1344 could help address one of those points above:

dumping databases into a folder, requesting experimental data such as (mw, mp/fp, solubility)

specifically when gpt hallucinates or makes up empirically measured data.

tbennett6421 avatar Mar 12 '24 02:03 tbennett6421

Localdocs currently does not have any support for custom file parsing though this would be a nice addition.

Since these are plain text formats, a minimum effort implementation would be to just add these formats back to the whitelist. At the time I removed them, I wanted to start with a clean slate because there were a lot of formats in that list that even if they worked, didn't seem like anyone would be using them.

Although I don't think it makes sense to use the LocalDocs feature as-is to process structured input, since it breaks it into chunks and destroys the global structure... it clearly worked well enough for a few people in the past. A slightly more useful implementation would e.g. keep the header for snippets of CSV, and keep the outer structure for XML and JSON.

cebtenzzre avatar Mar 12 '24 19:03 cebtenzzre