jan icon indicating copy to clipboard operation
jan copied to clipboard

feat: users can add epub and txt files for RAG retrieval functions in Jan

Open lineality opened this issue 1 year ago • 1 comments

Problem Retrieval should accept more types of documents. Users can (from what I can see) only add PDF documents for Retreival. This is not ideal for various reasons. "PDF" is very much not a standardized text storage format and perhaps no PDF application works broadly for text extraction on even most PDFs. Epub is highly organized, standarized, and readable, and txt and other plain-text documents such as .md should be easy for use with RAG (LLM's have no problem with markdown that I have seen). Even raw html might work. Most of a user's documents, from personal finance to books in the humble-bundle collection are not in use-able pdf format. (see provided code for epub text extraction below)

Success Criteria Users should be able to use perform retrieval functions on any common standardized files they have (which oddly may exclude PDF which isn't standardized). txt, md, and even epub should be low handing fruti, maybe docx, .odf, rtf, too.

Additional context For epub files:

Hopefully code like this (see most recent version) https://github.com/lineality/epub_ingestion_python/blob/main/epub_injestion_jsonl_txt_sized_chunks_v21.py

will be useful in letting users of Jan use epub-books with their Jan Retrieval uses.

This code extracts text from epub books and exports the text into a variety of formats: txt, json, jsonl, and can chunk to specific sizes without cutting words or sentences in half, to better retain meaning.

lineality avatar Feb 11 '24 20:02 lineality

Plain text files would be really useful, such as markdown or config files as then I can feed a README to the model

RichardoC avatar Apr 23 '24 13:04 RichardoC

Yes this is the most common requested feature & we're well awared of this. Improving RAG is scheduled in our roadmap.

Image

I'll mark this one as duplicated.

imtuyethan avatar Jul 02 '24 17:07 imtuyethan

#739 is the issue they're tracking this under, for anyone looking

RichardoC avatar Jul 04 '24 10:07 RichardoC