epic: Jan has document upload per thread (attached document per thread)
Goal
User Stories
- [ ] User story / task 1
- [ ] Clear achievable tasks (not too many)
Success Criteria
Clearly define criteria that determines the completion of this epic.
Not in Scope
Clarify what's intentionally out of scope or to be handled separately.
Design
Attach links or references to relevant design mockups, wireframes, or UX/UI flows.
Technical Considerations
Document notable engineering decisions, trade-offs, or dependencies.
Appendix
Relevant resources, references, documents, or inspirations that influence this epic.
hey... please start as simple as possible with nomic(as ambedder) and it will work well if iam right some code you find here https://github.com/nomic-ai/gpt4all important that you can set embedding size and number of snippets a fine start is "512token" and "4" snippets create a collection and option to see my docs , how many words and how many verctors embedded (maybe live embedding)
after answer from the model, maybe an option also before (show the snippets as plain text)
next step. you can set for every LLM model and embedd-Model and document with your own settings (systempromt, token, ...all) take a look Anything LM, they create workspaces for every settings
next step, implement more embedder models, bert-based are more or less same way, but it give some others qwen, jina and some ranking embedders...
and one nice option from LM-Studio(not opensource) among other things they have an option for User-Poweruser-Developer (button below and instantly the layout and more option setting are changed)
one options for embedding "chat" or "query" so more chat or force to keep all answer out of the doc!
and one option (small button near the chat) to keep the snippets in VRAM after first answer or delete them.
it should be set to delete by default when I ask a new question afterwards...
maybe in next next step an option to send prefixes to embedder https://www.youtube.com/watch?v=76EIC_RaDNw
and re-ranker models (but i dont know what is the different) one i have here https://huggingface.co/kalle07/embedder_collection
Thank you for sharing this, @kalle07. It's truly helpful.
@louis-menlo -- I started working on a simple document parsing implementation with a bit of RAG using llamaindex.ts and LanceDB on a separate branch based on the Tauri branch. This is mainly for enabling users to add documents, images, videos and audio while having it all in a db.lance file for storing and querying as users need it.
@dan-menlo mentioned you have specific plans for implementing this via MCP. Can I confirm with you whether you would like me to add support for the things I mentioned or remove this as it will be done in a different way?
cc @ramonpzg likely will drive this.
i dont know if it is on you ... but to prepare the input is better for the output ;) PDF and simple save as txt, often realy bad results at least for non simple structure...
for PDF i have checked some libs pdfplumber, fitz # PyMuPDF, camelot ... it give a lot only code_based i think it is not the future so we need a parser with a plus ;)
i think docling (open source) is a good choice. https://github.com/docling-project/docling
okay it need to download ~1GB (model) i think, and maybe more for some other option but the standard is quite ok. the python code is simple ... ~10 lines
- you get well formated txt
- options for well format tables and images ... atm i try to get txt and json format tabes in one (best for LLM) to get an image description for images in PDF i think maybe later ...
it give a lot option to parse a pdf to txt... in python worse is only to save as txt file
lot of dependencies are with ocr great quality but maybe a step to fare...
my option is with pdfplumber save txt layout and table with json format in one file ... nice featture multiprocessing
have fun ^^
Awesome! Thanks @kalle07. cc @ramonpzg
alternatively, if we want to go with python, I suggest we can take a look at pymupdf which provides a faster processing time
thanks a lot for the sample code @kalle07 💯
;) fitz / pymupdf is not that good in tables ... plumber can at least read some layouts, not all at once depend on table ...
Update:
- Adding simple RAG ingestion mechanism
- There will be an MCP server to query - agentic flow
Update:
- Finished the ingestion flow.
- Have the MCP server work with ingested data.
- Need to glue the UI with these functions.
@samhvw8 to add architecture design
Architecture Diagram
Sequence diagram
- It worked with remote providers.
- Updating to work with local embedding model.
- Lancedb works with rust part.
- We built all of required tools within the app -> prebuilt tools.
- Create a new assistant to work with document upload better?
UPDATE:
- Got blocked with local embedding model (Louis + Akarshan will work on asap).
- @urmauur to help on the UI. Simplify the flow.
Hello! Just an idea, maybe it's possible to integrate open source memory layer mem0? https://github.com/mem0ai/mem0 - i'm working with it, an can say, that it works perfectly! Can you check it please?
hi @t1m3c thanks for the idea. @louis-menlo @samhvw8 will take a look at it soon 👍
Hi! You are welcome! Also please check this one open source knowledge graph engine, looks like it outperforms mem0, testing it right now. https://github.com/getzep/graphiti
Hello I do apologize if I missed the documentation somewhere How to upload a file in this new version 0.6.4?
here i talked with him some lines... he has made a fare more advanced concept for future RAG (i dont get it all) but some ideas very great
https://github.com/onestardao/WFGY/blob/main/ProblemMap/README.md
Closed in place of parents
@Yip-Jia-Qi please use this for some reference from community also, and link it to your active issue