pypi-scout
pypi-scout copied to clipboard
Find Python Packages on PyPI with the help of vector embeddings
✨ Try it out at pypiscout.com ✨
What does this do?
Finding the right Python package on PyPI can be a bit difficult, since PyPI isn't really designed for discovering packages easily. For example, you can search for the word "plot" and get a list of hundreds of packages that contain the word "plot" in seemingly random order.
Inspired by this blog post about finding arXiv articles using vector embeddings, I decided to build a small application that helps you find Python packages with a similar approach. For example, you can ask it "I want to make nice plots and visualizations", and it will provide you with a short list of packages that can help you with that.
How does this work?
The project works by collecting project summaries and descriptions for all packages on PyPI with more than 100 weekly downloads. These are then converted into vector representations using Sentence Transformers. When the user enters a query, it is converted into a vector representation, and the most similar package descriptions are fetched from the vector database. Additional weight is given to the amount of weekly downloads before presenting the results to the user in a dashboard.
Stack
The project uses the following technologies:
- FastAPI for the API backend
- NextJS and TailwindCSS for the frontend
- Sentence Transformers for vector embeddings
Getting Started
Build and Setup
1. (Optional) Create a .env file
By default, all data will be stored on your local machine. It is also possible to store the data for the API on Azure Blob storage, and
have the API read from there. To do so, create a .env file:
cp .env.template .env
and fill in the required fields.
2. Run the Setup Script
The setup script will:
- Download and process the PyPI dataset and store the results in the
datadirectory. - Create vector embeddings for the PyPI dataset.
- If the
STORAGE_BACKENDenvironment variable is set toBLOB: Upload the datasets to blob storage.
There are three methods to run the setup script, dependent on if you have a NVIDIA GPU and NVIDIA Container Toolkit installed. Please run the setup script using the method that is applicable for you:
- Option 1: Using Poetry
- Option 2: Using Docker with NVIDIA GPU and NVIDIA Container Toolkit
- Option 3: Using Docker without NVIDIA GPU and NVIDIA Container Toolkit
[!NOTE] The dataset contains approximately 100.000 packages on PyPI with more than 100 weekly downloads. To speed up local development, you can lower the amount of packages that is processed locally by lowering the value of
FRAC_DATA_TO_INCLUDEinpypi_scout/config.py.
3. Run the Application
Start the application using Docker Compose:
docker-compose up
After a short while, your application will be live at http://localhost:3000.
Data
The dataset for this project is created using the PyPI dataset on Google BigQuery. The SQL query used can be found in pypi_bigquery.sql. The resulting dataset is available as a CSV file on Google Drive.