wikokit
wikokit copied to clipboard
Machine-readable Wiktionary
Language is a city to the building of which every human being brought a stone.
Ralph Waldo Emerson
Wikokit - Machine-readable Wiktionary
Stone I. Parser wikokit. This program parses Wiktionaries, constructs and fills machine-readable Wiktionaries.
Stone II. PHP API (piwidict project) to work with machine-readable Wiktionary.
Stone III. Dictionary kiwidict. A visual interface to the parsed English Wiktionary and Russian Wiktionary databases.
The goal of this project is to extract semi-structured information from Wiktionary and construct machine-readable dictionary (database + API + GUI).
Download new Wiktionary parsed databases from this page.
Stone III: Dictionary kiwidict - Android applications
- kiwidict offline multilingual dictionary and thesaurus based on the English Wiktionary.
- kiwidict-ru offline multilingual dictionary and thesaurus based on the Russian Wiktionary.
- magnetowordik word game based on data extracted from the English Wiktionary.
Graphical user interface (kiwidict and kiwidict-ru) supports (see release_notes.txt):
- words filtering by language code (e.g. de, fr)
- wildcard characters: the percent sign (%) matches zero or more characters, and underscore (_) a single character;
- todo: list of words only with meanings and / or semantic relations (use checkboxes).
After installation you can find the parsed Wiktionary database in SQLite format on your phone in the folder SD card/kiwidict/
.
Stone I: Parser and dictionary description
I) The maximum goal (in distant future) is to extract all information (i.e. all sections of entry) from all wiktionaries and convert data to machine-readable format.
II) Today's result. Now machine-readable Wiktionary contains the following information extracted from Russian Wiktionary and English Wiktionary:
- word's language and part of speech;
- meanings / definitions;
- semantic relations;
- translations;
- (^) context labels (from definitions);
- (^) quotations (text + bibliographic data).
(^) Context labels and quotations were extracted only from Russian Wiktionary.
Machine-readable Wiktionary framework:
I am interested that all two hundred Wiktionaries were parsed by this parser. But I know only Russian and English :)
If you are developer and if you are interested in adding modules to parse "your Wiktionary", then
- start from the paper describing the database (tables and relations) of machine-readable Wiktionary: Transformation of Wiktionary entry structure into tables and relations in a relational database schema. 2010. But there are new tables (absent in the publication) related to quotations and context labels, see Machine-readable database schema;
- GettingStartedWiktionaryParser — install parser and try to parse English Wiktionary and Russian Wiktionary;
- Play with parsed English or Russian Wiktionary SQL dump (download Wiktionary parsed databases);
- OneMoreWiktionary — extend parser in order to extract invaluable information from your Wiktionary.
Statistics
The machine-readable dictionary database statistics:
- English Wiktionary: total, semantic relations, translations, part of speech
- Russian Wiktionary: total, semantic relations, translations, part of speech, context labels, quote (languages & sources, authors with clusters, other authors, years)
Project structure
Wiki tool kit (wikokit) contains several projects related to wiki
./common_wiki — common (low-level) functions to handle data of Wikipedia and Wiktionary in MySQL database,
./common_wiki_jdbc — functions to handle data of Wiktionary in MySQL and SQLite databases (JDBC, Java SE) (depends on common_wiki.jar).
./android/common_wiki_alink — Eclipse copy (source link) of ./common_wiki (!NetBeans)
./android/common_wiki_android — functions for access to Wiktionary in Android SQLite version of database (depends on common_wiki.jar).
./android/magnetowordik — Android word game (Wiktionary thesaurus).
./hits_wiki — API for access to Wikipedia in MySQL database, algorithms to search synonyms in Wikipedia (depends on jcfd.jar, common_wiki.jar).
./TGWikiBrowser — visual browser to search for synonyms in local or remote Wikipedia (depends on hits_wiki.jar and common_wiki.jar)
./wikidf — Wiki Index Database (list of lemmas and links to wiki pages, which contain these lemmas).
./wikt_parser — Wiktionary parser creates a MySQL database (like WordNet) from an Wiktionary MySQL dump file. The project goal is to convert Wiktionary articles to machine-readable format. (It depends on common_wiki, common_wiki_jdbc)
./wiwordik — Visualization of parsed Wiktionary database. wiki + word = wiwordik.
The code of previous project Synarcher are used in wikokit.
Further reading
In English
- A. Krizhanovsky, A. Smirnov. An approach to automated construction of a general-purpose lexical ontology based on Wiktionary // Journal of Computer and Systems Sciences International, 2013, Vol. 52, No. 2, pp. 215–225.
- A. Smirnov, T. Levashova, A. Karpov, I. Kipyatkova, A. Ronzhin, A. Krizhanovsky, N. Krizhanovsky. Analysis of the quotation corpus of the Russian Wiktionary // Research in Computing Science, Vol. 56, pp. 101-112, 2012.
- A. Krizhanovsky. A quantitative analysis of the English lexicon in Wiktionaries and WordNet // International Journal of Intelligent Information Technologies (IJIIT), October-December 2012, Vol. 8, No. 4, pp. 13-22.
- F. Lin, A. Krizhanovsky. Multilingual ontology matching based on Wiktionary data accessible via SPARQL endpoint // In: Proceedings of the 13th Russian Conference on Digital Libraries RCDL’2011. October 19-22, Voronezh, Russia. – pp. 19-26. link2
- A. A. Krizhanovsky. Transformation of Wiktionary entry structure into tables and relations in a relational database schema. Preprint. 2010.
- A. A. Krizhanovsky. The comparison of Wiktionary thesauri transformed into the machine-readable format. Preprint. 2010.
- A. A. Krizhanovsky, F. Lin. Related terms search based on WordNet / Wiktionary and its application in Ontology Matching // In: Proceedings of the 11th Russian Conference on Digital Libraries RCDL’2009. September 17-21, Petrozavodsk, Russia. – pp. 363-369.
In Russian
- Крижановский А.А., Смирнов А.В., Круглов В.М., Крижановская Н.Б., Кипяткова И.С. Автоматическое извлечение словарных помет из Русского Викисловаря // Труды СПИИРАН. 2014. Вып. 2(33). С. 164-185.
- Крижановский А.А., Смирнов А.В. Подход к автоматизированному построению общецелевой лексической онтологии на основе данных викисловаря // Известия РАН. Теория и системы управления. N2, 2013, С. 53-63.
- Крижановский А. А., Луговая Н. Б., Круглов В. М. Извлечение и анализ дат произведений в корпусе цитат онлайн-словаря // Информационные технологии и письменное наследие: материалы VI междунар. науч. конф. El'Manuscript-12 (Петрозаводск, 3-8 сентября 2012) / отв. ред. В.А.Баранов, А.Г.Варфоломеев. – Петрозаводск; Ижевск, 2012. – 328 с. – C. 137—142. ISBN 978-5-8021-1402-5. (PDF)
- Смирнов А.В., Круглов В.М., Крижановский А.А., Луговая Н.Б., Карпов А.А., Кипяткова И.С. Количественный анализ лексики русского WordNet и викисловарей // Труды СПИИРАН. 2012. Вып. 23. С. 231–253.
- Крижановский А. Количественный анализ лексики английского языка в викисловарях и Wordnet // Труды СПИИРАН. 2011. Вып. 19. С. 87–101.
- Крижановский А. Оценка использования корпусов и электронных библиотек в Русском Викисловаре // Труды международной конференции «Корпусная лингвистика–2011». – СПб.: С.-Петербургский гос. университет, Филологический факультет, 2011, 348 с. – C. 217—222. ISBN 978-5-8465-0005-5.
- Крижановский А. Преобразование структуры словарной статьи Викисловаря в таблицы и отношения реляционной базы данных. Препринт. 2010.
- Крижановский А. Сравнение тезаурусов Русского и Английского Викисловарей, преобразованных в машиночитаемый формат. Препринт. 2010.
- Крижановский А. Машинная обработка Русского Викисловаря // Вики-конференция 2009. 24—25 октября, Санкт-Петербург.
See also
- Java Wiktionary Library (JWKTL)
- perl-wiktionary-parser // github.com, Perl module
- wiktionary_parser // github.com, Perl module
- Dbnary
- YARN (open WordNet-like thesaurus for Russian)
License
This program is multi-licensed and may be used under the terms of any of the following licenses:
- EPL, Eclipse Public License V1.0 or later, http://www.eclipse.org/legal
- LGPL, GNU Lesser General Public License V3.0 or later, http://www.gnu.org/licenses/lgpl.html
- GPL, GNU General Public License V3.0 or later, http://www.gnu.org/licenses/gpl.html
- AL, Apache License, V2.0 or later, http://www.apache.org/licenses
- BSD, New BSD License, http://www.opensource.org/licenses/bsd-license
Links
See documentation.