open-tamil
open-tamil copied to clipboard
Solvanam word corpus
Extract Solvanam word corpus from 2019 database dump. Wikipedia-விலும், tamilpulavar.org-இலும் 1கோடி சொற்கள் இருப்பதாக கேள்வி.
இதனை எப்படி Extract செய்வதாக உள்ளீர்கள்? Web Scraping மூலமாகவா?
Extracted solvanam and added here https://github.com/KaniyamFoundation/all_tamil_words
extract the tar.bz2 files.
Extracted solvanam and added here https://github.com/KaniyamFoundation/all_tamil_words
extract the tar.bz2 files.
Which means this issue is closed ?
no its not closed; just in the other issue we need to add this wordlist at parent-folder of https://github.com/Ezhil-Language-Foundation/open-tamil/blob/main/solthiruthi/data/tamilvu_dictionary_words.txt
Manual inspection of the file is also encouraged to ensure incorrect letters and words are not added; words are unique etc.