gpt4all
gpt4all copied to clipboard
[Feature] Process local files without localdocs
Feature Request - use documents without localdoc processing
One such use case - such as docx data extraction to json - for cleaning data for fine-tuning models or for localdocs. This feature would necessitate access to a raw file, not post-indexing by the localdoc process. I find pdf and docx extraction has its limitations when using localdocs, so I'd like to clean the data and put it into a custom json schema.
Most of the local LLMs you can currently use in GPT4All have a maximum context length of 4096 tokens - feed them any more data, and information from the beginning of the document will be lost. Are you working with fairly small documents (under a few thousand words), or do you e.g. have a lot of VRAM and intend to use a model finetuned on very long contexts?
Mainly short documents. My use case is manuals that have pdf/text software update docs; what's new, etc. While manuals aren't small, the update release docs are. Local processing is done with a 4070.