jan icon indicating copy to clipboard operation
jan copied to clipboard

epic: Jan has document upload per thread (attached document per thread)

Open dan-menlo opened this issue 9 months ago • 20 comments

Goal

User Stories

  • [ ] User story / task 1
  • [ ] Clear achievable tasks (not too many)

Success Criteria

Clearly define criteria that determines the completion of this epic.

Not in Scope

Clarify what's intentionally out of scope or to be handled separately.

Design

Attach links or references to relevant design mockups, wireframes, or UX/UI flows.

Technical Considerations

Document notable engineering decisions, trade-offs, or dependencies.

Appendix

Relevant resources, references, documents, or inspirations that influence this epic.

dan-menlo avatar Mar 20 '25 05:03 dan-menlo

hey... please start as simple as possible with nomic(as ambedder) and it will work well if iam right some code you find here https://github.com/nomic-ai/gpt4all important that you can set embedding size and number of snippets a fine start is "512token" and "4" snippets create a collection and option to see my docs , how many words and how many verctors embedded (maybe live embedding)

after answer from the model, maybe an option also before (show the snippets as plain text)

next step. you can set for every LLM model and embedd-Model and document with your own settings (systempromt, token, ...all) take a look Anything LM, they create workspaces for every settings

next step, implement more embedder models, bert-based are more or less same way, but it give some others qwen, jina and some ranking embedders...

and one nice option from LM-Studio(not opensource) among other things they have an option for User-Poweruser-Developer (button below and instantly the layout and more option setting are changed)

kalle07 avatar Mar 23 '25 16:03 kalle07

one options for embedding "chat" or "query" so more chat or force to keep all answer out of the doc!

and one option (small button near the chat) to keep the snippets in VRAM after first answer or delete them.

it should be set to delete by default when I ask a new question afterwards...

kalle07 avatar Mar 28 '25 08:03 kalle07

maybe in next next step an option to send prefixes to embedder https://www.youtube.com/watch?v=76EIC_RaDNw

and re-ranker models (but i dont know what is the different) one i have here https://huggingface.co/kalle07/embedder_collection

kalle07 avatar Apr 11 '25 17:04 kalle07

Thank you for sharing this, @kalle07. It's truly helpful.

louis-jan avatar Apr 21 '25 07:04 louis-jan

@louis-menlo -- I started working on a simple document parsing implementation with a bit of RAG using llamaindex.ts and LanceDB on a separate branch based on the Tauri branch. This is mainly for enabling users to add documents, images, videos and audio while having it all in a db.lance file for storing and querying as users need it.

@dan-menlo mentioned you have specific plans for implementing this via MCP. Can I confirm with you whether you would like me to add support for the things I mentioned or remove this as it will be done in a different way?

ramonpzg avatar Apr 21 '25 14:04 ramonpzg

cc @ramonpzg likely will drive this.

louis-jan avatar Apr 28 '25 02:04 louis-jan

i dont know if it is on you ... but to prepare the input is better for the output ;) PDF and simple save as txt, often realy bad results at least for non simple structure...

for PDF i have checked some libs pdfplumber, fitz # PyMuPDF, camelot ... it give a lot only code_based i think it is not the future so we need a parser with a plus ;)

i think docling (open source) is a good choice. https://github.com/docling-project/docling

okay it need to download ~1GB (model) i think, and maybe more for some other option but the standard is quite ok. the python code is simple ... ~10 lines

  • you get well formated txt
  • options for well format tables and images ... atm i try to get txt and json format tabes in one (best for LLM) to get an image description for images in PDF i think maybe later ...

kalle07 avatar May 01 '25 20:05 kalle07

it give a lot option to parse a pdf to txt... in python worse is only to save as txt file

lot of dependencies are with ocr great quality but maybe a step to fare...

my option is with pdfplumber save txt layout and table with json format in one file ... nice featture multiprocessing

have fun ^^

plumber_example3_multicore.zip

kalle07 avatar May 07 '25 16:05 kalle07

Awesome! Thanks @kalle07. cc @ramonpzg

louis-jan avatar May 07 '25 16:05 louis-jan

alternatively, if we want to go with python, I suggest we can take a look at pymupdf which provides a faster processing time thanks a lot for the sample code @kalle07 💯

david-menloai avatar May 07 '25 16:05 david-menloai

;) fitz / pymupdf is not that good in tables ... plumber can at least read some layouts, not all at once depend on table ...

kalle07 avatar May 07 '25 16:05 kalle07

Update:

  • Adding simple RAG ingestion mechanism
  • There will be an MCP server to query - agentic flow

louis-jan avatar May 26 '25 02:05 louis-jan

Update:

  • Finished the ingestion flow.
  • Have the MCP server work with ingested data.
  • Need to glue the UI with these functions.

louis-jan avatar May 28 '25 02:05 louis-jan

@samhvw8 to add architecture design

louis-jan avatar May 28 '25 02:05 louis-jan

Architecture Diagram

Image

Sequence diagram Image

samhvw8 avatar May 28 '25 08:05 samhvw8

  • It worked with remote providers.
  • Updating to work with local embedding model.
  • Lancedb works with rust part.
  • We built all of required tools within the app -> prebuilt tools.
  • Create a new assistant to work with document upload better?

louis-jan avatar Jun 04 '25 02:06 louis-jan

UPDATE:

  • Got blocked with local embedding model (Louis + Akarshan will work on asap).
  • @urmauur to help on the UI. Simplify the flow.

louis-jan avatar Jun 16 '25 02:06 louis-jan

Hello! Just an idea, maybe it's possible to integrate open source memory layer mem0? https://github.com/mem0ai/mem0 - i'm working with it, an can say, that it works perfectly! Can you check it please?

t1m3c avatar Jun 21 '25 08:06 t1m3c

hi @t1m3c thanks for the idea. @louis-menlo @samhvw8 will take a look at it soon 👍

david-menloai avatar Jun 22 '25 05:06 david-menloai

Hi! You are welcome! Also please check this one open source knowledge graph engine, looks like it outperforms mem0, testing it right now. https://github.com/getzep/graphiti

t1m3c avatar Jun 22 '25 05:06 t1m3c

Hello I do apologize if I missed the documentation somewhere How to upload a file in this new version 0.6.4?

Hubert21 avatar Jul 13 '25 09:07 Hubert21

here i talked with him some lines... he has made a fare more advanced concept for future RAG (i dont get it all) but some ideas very great

https://github.com/onestardao/WFGY/blob/main/ProblemMap/README.md

kalle07 avatar Aug 14 '25 18:08 kalle07

Closed in place of parents

LazyYuuki avatar Aug 15 '25 08:08 LazyYuuki

@Yip-Jia-Qi please use this for some reference from community also, and link it to your active issue

LazyYuuki avatar Sep 19 '25 01:09 LazyYuuki