markdown-file-query
markdown-file-query copied to clipboard
Semantic QA with a markdown database: Query any markdown file using vector embedding, Pinecone vector database and GPT (langchain). A weaker version of privateGPT
This project currently works best with English documents.
About This Project
this project
- utilizes Pinecone vector database (VDB) and OpenAI (vector) embedding model to turn texts into vectors.
- works with any
.mdfile, so it works perfectly with Notion & Obsidian (though for Notion you have to export it to.mdmanually first) - is the author's practice of Feynman technique.
- is probably a weaker duplicate of privateGPT and llama_index, if you want a beautifully-crafted document query program, you should use llama_index instead of this toy.
Walkthrough of this Program
- Each markdown file in the target directory is cut into lots of small chunks using
langchain.textsplitter - Each chunck is turned into a vector via OpenAI's embedding model (
langchain.embeddings.OpenAIEmbeddings) - The vectors are then uploaded to
Pineconevector database. - Queries are also converted to vectors using the vector embedding model and uploaded to Pinecone.
- To retrieve search results, we compare the query vector with vector database using Pinecone (by cosine similarity).
- Closest 3 results are retrieved and fed into GPT-3 along with the question, and GPT-3 will generate an answer in natural language.
TODO
- [ ] add a
--helpoption - [ ] deploy to Streamlit
Getting Started
Prerequisites
- Prepare Pinecone and OpenAI API key:
- To export the Pinecone and OpenAI API key to system environment
now in Python useexport PINECONE_API_KEY="your_pinecone_api_key" export OPENAI_API_KEY="your_openai_api_key"
to check if you have them exported to system environment, ifimport os os.environ["PINECONE_API_KEY"] os.environ["OPENAI_API_KEY"]KeyError, then restart the terminal upon completion (and your IDE if you are using one).
Installation
- clone this repo to your local machine
git clone https://github.com/madeyexz/markdown-file-query.git - Install the dependencies
pip install pinecone langchain tqdm
Usage
- Prepare the markdown file(s) and put them in a
FOLDER(or any name you like, but you have to change the code accordingly). Notice this should be in the same directory asmain.py. - If this is your first time querying a certain document, run the
main.pyprogrampython3 main.py "PATH_OF_FOLDER" "QUESTION" - The query results and the reference GPT used to generate the answer will be saved in
answer.txtandcontents.txtrespectively. - If you want to query the same batch of documents again, then run the
query_only.pyto avoid re-embedding the documents.python3 query_only.py "QUESTION"
Example
- I have a folder called
markdown_databasewhich contains a bunch of.mdfiles, I want to query this database with the question "Whats the strange situation"❯ python3 main.py "markdown_database" "what's the strange situation"initiating pinecone index... digesting docs... uploading datas to pinecone... 92%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 60/65 [00:29<00:02, 1.87it/s] let's wait for 60 seconds to avoid RateLimitError... \(since im not a paid user\)) 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 60/60 [01:00<00:00, 1.00s/it] 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 65/65 [01:32<00:00, 1.42s/it] querying pinecone... querying gpt... writing results to answer.txt and contents.txt done! the answer to 'what's the strange situation' is: ' The Strange Situation is a standardized procedure devised by Mary Ainsworth in the 1970s to observe attachment security in children within the context of caregiver relationships. It applies to infants between the age of nine and 18 months and involves a series of eight episodes lasting approximately 3 minutes each, whereby a mother, child and stranger are introduced, separated and reunited. The procedure is used to observe the quality of a young child’s attachment to his or her mother, and can also be applied to other attachment figures, such as God, through the use of Emotionally Focused Therapy (EFT) and religious beliefs, such as the saying “there are no atheists in foxholes”.' - If I want to query the same database again, I can use
query_only.pyto avoid re-embedding the documents.❯ python3 query_only.py "Who is Mary Ainsworth?"connecting to pinecone index... getting docs querying pinecone... querying gpt... done! the answer to 'Who is Mary Ainsworth?' is: ' Mary Ainsworth was a developmental psychologist who devised the Strange Situation in the 1970s to observe attachment security in children within the context of caregiver relationships. The Strange Situation involves a series of eight episodes lasting approximately 3 minutes each, whereby a mother, child and stranger are introduced, separated and reunited. Ainsworth is also known for her observation that if you want to see the quality of a young child’s attachment to his or her mother, watch what the child does, not when Mother leaves, but when she returns. She is also known for her research on anxious babies and their inability to use their mothers as a secure base.'
Known Limitation
-
If you use Pinecone, then whenever you want to query a new document (i.e. creating a new database), you should probably create a new Pinecone index (for you don't want answers from the old document), or delete the old index. This is because Pinecone does not support updating the index (yet).
To delete the old index:
python3 delete_pinecone_index.py NAME_OF_INDEX
Acknowledgements
Huge shout out to the open-source community for providing straight-forward examples and comprehensive tutorials!
- openai-cookbook: using vector database for embeddings search
- Build a Personal Search Engine Web App using Open AI Text Embeddings - Avra
- this project is heavily inspired by hwchase17/notion-qa
- Langchain, a Python library for manipulating LLMs elegently.