tagalog-dictionary-scraper
tagalog-dictionary-scraper copied to clipboard
Builds a Tagalog dictionary by collecting Tagalog words from tagalog.pinoydictionary.com
Tagalog Dictionary Scraper :ledger:
Ating pag-ibayuhin ang ating talahuluganan!
Collects Tagalog words from tagalog.pinoydictionary.com, a database of Tagalog words powered by Cyberspace.ph Web Hosting. This script uses a common web scraping technique known as HTML parsing.
42,723 words (as of Feb 19, 2023)
See the word list at tagalog_dict.txt
API Resource
Served through GitHub Pages, the scraped words are accessible via REST resource.
Host
https://raymelon.github.io/tagalog-dictionary-scraper/
Method
GET
Resources Available
Resource | Display | Endpoint |
---|---|---|
csv |
default |
/tagalog_dict.csv |
csv |
with lines |
/tagalog_dict_lines.csv |
json |
default |
/tagalog_dict.json |
json |
with lines |
/tagalog_dict_lines.json |
txt |
default |
/tagalog_dict.txt |
How is it done? :muscle:
Each webpage is loaded and parsed, extracting the words enclosed in <h2 class='word-entry'>
tag.
Included is tagalog.pinoydictionary.com
html
snippet containing the source of
http://tagalog.pinoydictionary.com/list/a/
to serve as point of reference on how dictionary words from the page are extracted.
Disclaimer:
I do not own the html
code cited above, it is owned by tagalog.pinoydictionary.com.
How did the project started? :thought_balloon:
The main purpose of this project is for a Scrabble ® Tagalog dictionary database, but other uses may vary.
Tools :pencil2:
- Python3 v3.5+ :snake:
- beautifulsoup4 v4.5.1 :ramen: :package: for parsing html pages
python -m pip install -U pip beautifulsoup4
- requests-futures v1.0.0 :zap: for request concurrency
python -m pip install -U pip requests-futures
Notes :pushpin:
- Run the scraper script
collect_tagalog.py
- See the output of collected words at
tagalog_dict.txt
- Match
max_workers
value with the CPU and network capacity of the environment. See the comment for estimated values and expected download rates.