wiktextract [IMPROVEMENT] Can we can have etymology present and also group parts of speech together in an etymology?

[IMPROVEMENT] Can we can have etymology present and also group parts of speech together in an etymology?

Open meetDeveloper opened this issue 2 years ago • 28 comments

We can see for word set in the image that parts of speech are subpart of etymology.

Basically different origin of words have different parts of speech.

Some words have just one origin and in that case Wiktionary shows them as:

I think we should group the parts of speech in array where each array correspond to different origin.

So for word set it will be:

[
  {
    word: "set",
    etymology: "etymology1",
    meanings: [
       // and here we will have all the parts of speech that is present in this etymology.
    ]
  },
  {
    word: "set",
    etymology: "etymology2",
    meanings: [
       // and here we will have all the parts of speech that is present in this etymology.
    ]
  }
]

Aug 24 '21 03:08 meetDeveloper

I'm planning to add etymology extraction and parsing some of the more common etymology relations. I expect to have it within a month or so. I think annotating each part-of-speech block with the etymology also makes sense.

Aug 26 '21 11:08 tatuylonen

I decided to implement this right away and just committed the changes. Each part-of-speech entry in the returned data now contains the "etymology-text" and "etymology-templates" fields (if an etymology section was present). Etymology-text contains the cleaned text of the full etymology section (i.e., templates and Lua modules are expanded, and HTML tags are stripped, with various other cleanup). The "etymology-templates" field contains a list of templates used in the Etymology section, each entry being a dictionary with the fields "name" for template name, "args" for (cleaned) template arguments, and "expansion" the cleaned text the template expands to. This also contains any nested template expansions, but some common templates not useful for determining etymological relations have been removed.

Does this meet your needs? Let me know if a different output format would be more useful? (I'm thinking trying to process and encode all possible etymological relations - there are a number that seem language-specific and more will probably be defined in the future - is a bit beyond the scope of converting Wiktionary into structured data. I'm thinking the current format should make it very easy for downstream projects to parse the etymology information from the output.)

The extracted etymology sections should be included in the extracted raw data at https://kaikki.org/dictionary/rawdata.html starting tomorrow. You can also search for any word and they should be reflected in the JSON for each word (they are not yet rendered in the main text of the web pages, but when you click "Show JSON" at the bottom, you should see them included in the data.)

Aug 26 '21 13:08 tatuylonen

Regarding grouping parts-of-speech under an etymology, I think it is better to keep the current format. If you want to group them by etymology, you can group everything with the same "word" and "etymology-text" together. However, changing the data format so drastically at this point would force every downstream project already using this to change their code, and overall parsing it would become more complicated. I think a better tradeoff is for those few wanting to group them under an etymology to add the 10-20 lines needed to group by "word" and "etymology-text".

Aug 26 '21 15:08 tatuylonen

@tatuylonen Hey sorry for the late reply, I could group using etymology-text . When can I expect the main page to show the etymology text, I tried checking "Show JSON" for word set , but they are empty strings. Also could by any change you could also parse the audio link? Thanks for your work, it is really helping community.

Sep 04 '21 02:09 meetDeveloper

I had a bug in etymology extraction, where the etymology information got stored in a page-global data structure (and copied to all etymologies on the page) instead of being stored in the data for a particular etymology. This should now be fixed and should be reflected in the next update of the website on https://kaikki.org.

Note that on "set", the last etymology-text actually should be empty, as it is {{rfe}} in the source, and I expand rf* templates into empty, as they would generate dialog boxes and basically notes to editors rather than information about the word. The other four English etymologies on the page ("set") now seem to be correctly extracted. Etymologies for other languages also now look correct; however Etymology 2 for Polish has an empty etymology section, and it gets extracted as empty.

Please let me know if you notice additional problems.

Sep 05 '21 22:09 tatuylonen

@tatuylonen Thanks a lot :) I will look into this and let you know if I see any other problem. Also a request can you also parse the audio urls?

Sep 10 '21 10:09 meetDeveloper

I think I just found out how to compute the URLs (I haven't been able to find this information before). I'll probably implement the URL calculation next week. (The URLs are not present in the expanded code, but are instead calculated in Javascript code, so it is not possible to just parse them. However, according to the information I found the URLs can be calculated from the file name.)

Sep 10 '21 21:09 tatuylonen

I just committed a change that includes "ogg_url" and "mp3_url fields for audio files. This should be reflected on https://kaikki.org tomorrow (unless something goes wrong with the overnight update). It is possible there may still be minor changes in the urls in the coming days as I get more data from actually downloading (currently about one in 200-500 downloads fails, but most of these seem to be incorrect information in Wiktionary or transcoded files that seem to be truly missing).

I'm also planning to make the full set of downloaded audio files available in bulk at https://kaikki.org/dictionary/rawdata.html (this may take a few days as I'm currently downloading the files and there are quite a few).

Sep 14 '21 12:09 tatuylonen

It looks like many of the URLs not found (<0.5% of all URLs are not found) have redirects in Wikimedia commons and thus the true URL cannot be calculated from the file name only. The options seem to be parsing redirects from a Commons dump or using the API to process them. I'm leaving this as a TODO entry, but I'm not making it a high priority at this time (both approaches have complications).

Sep 14 '21 13:09 tatuylonen

Okay thanks for all you have done. 🙏 I have a question are all languages supported by this extractor? And are all of the supported ones have same level of support?

Sep 17 '21 01:09 meetDeveloper

Unless something goes wrong with my update run today, about 99.5% of all audio files from Wiktionary should be available for bulk download at https://kaikki.org/dictionary/rawdata.html tomorrow (Monday). It's a 22GB .tar file, containing about 942,000 sound files; most files are in both .ogg and .mp3 format. Files that have redirects on Wikimedia Commons are currently missing (their URLs cannot be determined without parsing Commons dumps or calling Commons APIs); I'm planning to implement fetching them at a later time.

Sep 19 '21 15:09 tatuylonen

Thanks a lot, can you also show along with the media file name the url that is present on the Wiktionary which you downloaded. It would be great if in the JSON that url is also present.

Also one more question is level of support same for other languages?

Sep 23 '21 12:09 meetDeveloper

Hey I just noticed url is present in JSON, I might have missed this before. Thanks a lot for your work. It will really help me a lot. Do we have same level of support for other languages beside English?

Sep 25 '21 00:09 meetDeveloper

@tatuylonen I see that the other languages parsed are not the one present in that language Wiktionary but instead are present in English Wiktionary itself. For example for Spanish word hola entry is parsed from this Wiktionary entry instead of this . I wanted to know why is it like this and do we have support for the other one?

Sep 25 '21 05:09 meetDeveloper

First, using the English dictionary gives the glosses in English for all languages. As far as I know, glosses, subtitles, annotation and template names in other Wiktionary editions are in other languages (different for each edition).

The second and more practical reason though is that all editions use a completely different set of subtitles and templates. While templates are expanded by wiktextract, it would still need to deal with the different subtitles, linguistic markers, category names, etc. I may add this configurability later, but it is a major effort and I'm postponing it for future versions (i.e., not yet in 2.0). Besides just the programming effort, it needs someone who speaks the language in question well to assist in interpreting the data.

Sep 27 '21 12:09 tatuylonen

The URLs for pronunciation audio files are present also for non-English language entries in the English wiktionary. (Note that the language of the edition primarily means the language in which the glosses and other data are written; the English Wiktionary contains extensive data for hundreds of languages and some data for thousands of languages)

Sep 27 '21 12:09 tatuylonen

I decided to make a change in the etymology formats: the field names are now etymology_text and etymology_templates. I decided to make the change for consistency with other field names in the data. Sorry about the extra work to adjust the names in your code.

Oct 06 '21 12:10 tatuylonen

The change should be reflected on https://kaikki.org tomorrow.

Oct 06 '21 12:10 tatuylonen

@tatuylonen No problem, I will make changes on my side. Also I was looking at examples section in post processed JSON. I found that examples and quotations are considered alike, for example:

Here Hello? How may I help you? is example but Hello. This is Marsha. - Yes, Marsha. is quotation. Is there any reason we have not separated it?

Oct 08 '21 03:10 meetDeveloper

I just added a "type" field in examples. It will now contain "example" or "quotation" if type could be determined. (Sometimes usage examples are just written out as text in Wiktionary, and in those cases the type cannot be determined. Though I suppose if it has a "ref" field it probably is a quotation.)

I also made various fixes to usage example processing today. The type field and the fixes should be reflected on https://kaikki.org tomorrow unless something goes wrong with my overnight update.

Nov 19 '21 19:11 tatuylonen

@tatuylonen Thanks for all the work that you have put in this extractor. I was integrating this in my API and I found that audio urls were missing for word set and other words.

Dec 18 '21 02:12 meetDeveloper

@tatuylonen Any updates on above? Btw Happy new year :)

Jan 02 '22 03:01 meetDeveloper

@tatuylonen It seems that the etymology_templates field is an empty array for every word as of today. Any idea what's going on?

Jan 03 '22 00:01 acornellier

This was caused by a change just before Christmas in clean_node() in page.py, which caused template_fn argument to not be correctly passed forward. Unfortunately my tests didn't catch it (I'll need to add more tests). This should now be fixed in the code. I just started regenerating the site, and the fix should be reflected on https://kaikki.org in about 15 hours (unless something goes wrong with the run).

Audio URLs and etymology templates were missing for the same reason and this should fix both.

This was one of those classical "make a change just before your vacation" situations... I'm sorry about that.

Jan 09 '22 22:01 tatuylonen

@tatuylonen No worry, happens with all of us :)

I was looking at https://kaikki.org/dictionary/rawdata.html and there I noticed that it is written:

For post-processed data, please look at the download links at the end of the main page for each language (or the page for all languages combined) under https://kaikki.org/dictionary/.

Also I noticed on this page https://kaikki.org/dictionary/ at the end it is written:

To download the full raw data that was extracted from Wiktionary using wiktextract, please see the raw data download page. To download the post-processed data used on this site for the kaikki.org machine-readable dictionary, please look for the download links near the end of the main page for each language (or the all languages combined page) and various subpages.

Could you explain what is the difference between the raw data and the post processed data? What are the things changed, as far as I am able to get that from the Wiktionary dump the data is extracted using wiktextract, what else do we do that comes under post processing?

Also suppose I download post processed data for English from https://kaikki.org/dictionary/English/kaikki.org-dictionary-English.json, how frequent does this gets updated?

Jan 15 '22 08:01 meetDeveloper

@tatuylonen Any updates on above query?

Jan 28 '22 10:01 meetDeveloper

@tatuylonen I had a query regarding the approach suggested by you earlier.

Regarding grouping parts-of-speech under an etymology, I think it is better to keep the current format. If you want to group them by etymology, you can group everything with the same "word" and "etymology-text" together. However, changing the data format so drastically at this point would force every downstream project already using this to change their code, and overall parsing it would become more complicated. I think a better tradeoff is for those few wanting to group them under an etymology to add the 10-20 lines needed to group by "word" and "etymology-text".

For some words there are more than 1 etymology and some of the etymology sections are empty in this case one cannot group correctly under the etymology, one won't be able to use word to group because it is same many of the times for different etymology.

Also I was going through the entry layout wiki in Wiktionary and there the structure that they said Wiktionary is having is:

===Etymology 1=== ====Pronunciation==== ====Noun==== ===Etymology 2=== ====Pronunciation==== ====Noun==== ====Verb====

Will it be possible to have v2 of this API to provide JSON response in this form. I looked through Oxford and other APIs they also follow same template, following this template will make it easier for developers to use this API in their project as a replacement of any closed source API that they were using earlier.

Mar 04 '22 22:03 meetDeveloper

For some words there are more than 1 etymology and some of the etymology sections are empty in this case one cannot group correctly under the etymology, one won't be able to use word to group because it is same many of the times for different etymology.

I've run into this as well. One potential solution that would not break data format compatibility for existing projects would be to add a key etymology_number. If the term is under a section of the form ===Etymology n===, the value would be n. If the term is under a section of the form ===Etymology===, or not under any etymology, the key would be omitted (or the value could be 0).

Aug 17 '22 16:08 jmviz

@tatuylonen @kristian-clausal I think suggestion by @jmviz looks great, it will help a lot if we add this information in the response. This will make it super easy to group by etymology. Will it be possible to add this?

Sep 24 '22 12:09 meetDeveloper

I just implemented this - basically it adds "etymology_number" key to all entries that come from an Etymology section with a number in the title. This turned out to be only three lines of code added.

This will probably be reflected on https://kaikki.org by Monday.

Oct 06 '22 21:10 tatuylonen

wiktextract wiktextract copied to clipboard

[IMPROVEMENT] Can we can have etymology present and also group parts of speech together in an etymology?

wiktextract
wiktextract copied to clipboard