jan
jan copied to clipboard
feat: users can add epub and txt files for RAG retrieval functions in Jan
Problem Retrieval should accept more types of documents. Users can (from what I can see) only add PDF documents for Retreival. This is not ideal for various reasons. "PDF" is very much not a standardized text storage format and perhaps no PDF application works broadly for text extraction on even most PDFs. Epub is highly organized, standarized, and readable, and txt and other plain-text documents such as .md should be easy for use with RAG (LLM's have no problem with markdown that I have seen). Even raw html might work. Most of a user's documents, from personal finance to books in the humble-bundle collection are not in use-able pdf format. (see provided code for epub text extraction below)
Success Criteria Users should be able to use perform retrieval functions on any common standardized files they have (which oddly may exclude PDF which isn't standardized). txt, md, and even epub should be low handing fruti, maybe docx, .odf, rtf, too.
Additional context For epub files:
Hopefully code like this (see most recent version) https://github.com/lineality/epub_ingestion_python/blob/main/epub_injestion_jsonl_txt_sized_chunks_v21.py
will be useful in letting users of Jan use epub-books with their Jan Retrieval uses.
This code extracts text from epub books and exports the text into a variety of formats: txt, json, jsonl, and can chunk to specific sizes without cutting words or sentences in half, to better retain meaning.
Plain text files would be really useful, such as markdown or config files as then I can feed a README to the model
Yes this is the most common requested feature & we're well awared of this. Improving RAG is scheduled in our roadmap.
I'll mark this one as duplicated.
#739 is the issue they're tracking this under, for anyone looking