text-to-sql
text-to-sql copied to clipboard
An application to write and run SQL queries, returning answers to natural language questions, using langchain and open source LLM models through HuggingFace.
Text-to-SQL Copilot
Text-to-SQL Copilot is a tool to support users who see SQL databases as a barrier to actionable insights. Taking your natural language question as input, it uses a generative text model to write a SQL statement based on your data model. Then runs it on your database and analyses the results. And it does this all at no cost using HuggingFace Inference API.
Setup
Dataset
This was built specifically off of the Spider dataset. Follw these steps to recreate:
- Download the data from this Google Drive
- Unzip the file
- Save the root 'spider' folder under the src/data/raw/ directory
Setup Process
This application pulls the schema information from the SQLite database files and utilizes a locally stored Chroma Vector database to identify which schema to use to answer questions. Run the following commands to compile the database info and build the vector database:
pip3 install -r requirements.txt
python3 setup.py
This will take about 10-15 minutes to fully run.
HuggingFace API Token
Currently, this project relies on the google flan-t5-xxl languauge model. It is accessed for free through the HuggingFace Inference API. In order to use this method, you need to create an API token and save in in a .env file in the root of the repo:
touch .env
Open the .env file and enter your HuggingFace API token:
Using SQL Copilot
Navigate to the src/app directory and start the program with the following command:
python3 main.py
Then input your question - happy SQL-ing!
Citation
Chase, H. (2022). LangChain [Computer software]. https://github.com/hwchase17/langchain
Yu, T., Zhang, R., Yang, K., Yasunaga, M., Wang, D., Li, Z., ... & Radev, D. (2018). Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. arXiv preprint arXiv:1809.08887.