leaflyer
leaflyer copied to clipboard
🍀 Scraps data from Leafly and collects amazing weed-based meta-data.
Leaflyer: Cannabis Data Scrapper
This repository holds all the necessary code to run the automation robot that extracts strain-related information at Leafly.
Update: Leafly now uses an advanced mechanism for detecting web-scrapping (Cloudflare v2 and re-captcha). Thus this project will not be supported anymore as it now involves breaking their detection with external services.
If you are interested in the most recent Leafly data dump (20th September 2022), please contact [email protected].
Package Guidelines
Installation
Install all the pre-needed requirements using:
pip install -r requirements.txt
(Optional) Download the Data
We have already dumped all Leafly's data and made it available in both .json
and .csv
formats. Note that there might be some missing values as Leafly's database is incomplete for not well-known strains.
The dataset and its additional information are available at Kaggle.
Usage
Scrap List of Strains
Initially, one needs to scrap/dump the list of strains (URL format) to proceed with the meta-data extraction. To accomplish such a step, one needs to use the following script:
python scrap_strains_list.py -h
Note that -h
invokes the script helper, which assists users in employing the appropriate parameters.
Scrap Strains Meta-Data
Further, with the strains' list in hand, it is now possible to extract JSON-like information from every URL. To fulfill this purpose, use the following script:
python scrap_strains_data.py -h
Bash Script
Instead of invoking every script to conduct the automation, it is also possible to use the provided shell script as follows:
./pipeline.sh
Such a script will conduct every step needed to accomplish the automation process. Furthermore, one can change any input argument defined in the script.
Support
We know that we do our best, but it is inevitable to acknowledge that we make mistakes. If you ever need to report a bug, report a problem, talk to us, please do so! We will be available at our bests at this repository or [email protected].