gpt4all icon indicating copy to clipboard operation
gpt4all copied to clipboard

[Feature] Process local files without localdocs

Open mkammes opened this issue 10 months ago • 2 comments

Feature Request - use documents without localdoc processing

One such use case - such as docx data extraction to json - for cleaning data for fine-tuning models or for localdocs. This feature would necessitate access to a raw file, not post-indexing by the localdoc process. I find pdf and docx extraction has its limitations when using localdocs, so I'd like to clean the data and put it into a custom json schema.

mkammes avatar Mar 31 '24 19:03 mkammes

Most of the local LLMs you can currently use in GPT4All have a maximum context length of 4096 tokens - feed them any more data, and information from the beginning of the document will be lost. Are you working with fairly small documents (under a few thousand words), or do you e.g. have a lot of VRAM and intend to use a model finetuned on very long contexts?

cebtenzzre avatar Apr 01 '24 21:04 cebtenzzre

Mainly short documents. My use case is manuals that have pdf/text software update docs; what's new, etc. While manuals aren't small, the update release docs are. Local processing is done with a 4070.

mkammes avatar Apr 01 '24 22:04 mkammes