open-tamil icon indicating copy to clipboard operation
open-tamil copied to clipboard

Solvanam word corpus

Open arcturusannamalai opened this issue 4 years ago • 4 comments

Extract Solvanam word corpus from 2019 database dump. Wikipedia-விலும், tamilpulavar.org-இலும் 1கோடி சொற்கள் இருப்பதாக கேள்வி.

arcturusannamalai avatar May 17 '20 06:05 arcturusannamalai

இதனை எப்படி Extract செய்வதாக உள்ளீர்கள்? Web Scraping மூலமாகவா?

Parathantl avatar May 11 '21 16:05 Parathantl

Extracted solvanam and added here https://github.com/KaniyamFoundation/all_tamil_words

extract the tar.bz2 files.

tshrinivasan avatar May 11 '21 16:05 tshrinivasan

Extracted solvanam and added here https://github.com/KaniyamFoundation/all_tamil_words

extract the tar.bz2 files.

Which means this issue is closed ?

VpkPrasanna avatar Oct 23 '22 14:10 VpkPrasanna

no its not closed; just in the other issue we need to add this wordlist at parent-folder of https://github.com/Ezhil-Language-Foundation/open-tamil/blob/main/solthiruthi/data/tamilvu_dictionary_words.txt

Manual inspection of the file is also encouraged to ensure incorrect letters and words are not added; words are unique etc.

arcturusannamalai avatar Oct 30 '22 00:10 arcturusannamalai