WDoc icon indicating copy to clipboard operation
WDoc copied to clipboard

Feature Request: Vector DB Support & Python API Enhancement

Open NDA-Github opened this issue 1 year ago • 8 comments

First of all, thank you!

I want to start by thanking you for this amazing library. wdoc takes RAG to another level with its powerful features, great documentation and overall thoughtful implementation. The way it handles document processing, querying and summarization is really impressive.

Feature requests

I have two suggestions that could make wdoc even more versatile:

1. Support for vector databases

It would be great to have the option to store embeddings in vector databases like ChromaDB or Pinecone. This would allow:

  • Better scalability for large document collections
  • Persistence of embeddings across sessions
  • Potential for distributed deployments
  • Real-time updates to the document collection

2. Python API for easier integration

While the CLI interface is great, having a proper Python API would make it easier to integrate wdoc into other applications. For example:

from wdoc import WDoc

wdoc = WDoc()
db= #anyDbClient

#Embedding
embeddings = wdoc.create_embeddings(
documents=["doc1.pdf", "doc2.pdf"],
model="openai/text-embedding-3-small",db=db)

#Query
response = wdoc.query(
query="What is the main topic?",
documents=embeddings)

This would make it simpler to:

  • Use wdoc as a library in other Python projects
  • Chain operations programmatically
  • Customize the workflow for specific use cases

Let me know if you'd like me to elaborate on any of these suggestions. Thanks again for this great tool!

NDA-Github avatar Nov 26 '24 07:11 NDA-Github

(I just want you to know that I saw your message right away but don't have the time to fully reply yet!)

thiswillbeyourgithub avatar Nov 27 '24 22:11 thiswillbeyourgithub

Hi!

First of all thank you for your kind words. Can I ask you where you heard about wdoc? What you use it for?

Second, I'm finally done with a side project that finally updates the TODO list in the readme, so you can take a look as what I envision.

I already noticed that the python API was pretty bad, but I need to refactor quite a lot of entangled stuff to do it. But because you asked for it I moved it to the top of my priorities regarding wdoc. Unfortunately I'm awfully busy for quite some time still. Would you be interested in contributing to the changes? I could give you pointers!

Regarding supporting other DBs, I'm totally on board with it but it comes after a few other things (notably refactoring the API, and making the openwebui pipeline).

And sorry but I have to ask :) : did you ask an LLM to use flattery to sugar coat a feature request?

thiswillbeyourgithub avatar Nov 29 '24 18:11 thiswillbeyourgithub

Hi!

Thank you for your feedback. I used it for extracting some financial information from financial reports.

After exploring various approaches, starting with basic RAG and experimenting with different methods, I spent time researching multiple GitHub projects before finding that yours truly stood out.

Thank you for updating the priorities - it's great to see active development and to be able to follow this project even more closely.

I'm interested in contributing to the project. I'd appreciate if we could have a discussion beforehand to ensure I fully understand the direction and requirements.

And yes, I did use an LLM to help formulate this request 😄

Let me know when would be a good time to discuss potential contributions.

Dalavidhy avatar Dec 01 '24 14:12 Dalavidhy

Hi @davidalhyar, sorry for the wait!

Thanks for sharing your use case with financial reports - it's great to see wdoc being used in such practical applications!

Regarding contributions, I'd love to have your help. Here's the current refactoring roadmap, with tasks that need to be completed in this specific order:

  1. ~~Write unit tests for core features - this will serve as a safety net for the refactoring~~ Done, but the tests need to be comprehenssive now.
  2. Reorganize the codebase:
    • Move query/search code to tasks/query.py
    • Extract argument validation to its own method
    • ~~Split the initialization code to create a cleaner API~~
    • Untangle the current "spaghetti code" state of the wdoc class declaration
    • stop using arg import_mode
  3. Verify that critical features still work properly:
    • decorator of the wdoc class, and dynamic docstring
    • ~~the --help flag works~~
    • ~~the USAGE.md file~~
    • ~~the mechanism from init and main that allow calling from cli~~
    • ~~the wdoc_parse_file mechanism~~
    • ~~the README examples~~
    • Finally update the files in scripts to make sure they respect the new api.

Only after these steps are completed can we properly implement the vector DB support you suggested. Would you be interested in helping with any of these specific tasks? We could start with the unit tests, as they're relatively self-contained.

Let me know which part interests you most, and I can provide more detailed technical guidance. We could use GitHub Discussions for the technical deep-dive if you prefer.

As a medical student my keyboard typing time is limited and I'm currently spread pretty thin among my other projects so helping out is a sure way to get this much sooner than if you wait for me :)

thiswillbeyourgithub avatar Dec 04 '24 23:12 thiswillbeyourgithub

Addendum: also something I should do but haven't taken the time to learn is to create a readthedocs website for the documentation. Have you experience with that?

thiswillbeyourgithub avatar Dec 05 '24 08:12 thiswillbeyourgithub

Addendum: also something I should do but haven't taken the time to learn is to create a readthedocs website for the documentation. Have you experience with that?

Well actually it was simpler that I thought so nevermind.

thiswillbeyourgithub avatar Dec 05 '24 12:12 thiswillbeyourgithub

(update: I made some pytests but they need to be comprehensive now.)

thiswillbeyourgithub avatar Dec 05 '24 15:12 thiswillbeyourgithub

Update: one thing led to another and I actually much improved the doc and improved the api. There's still work to be done though.

You can take a look at the page walkthrough and examples to see how to use wdoc.

thiswillbeyourgithub avatar Feb 19 '25 20:02 thiswillbeyourgithub