mwoffliner icon indicating copy to clipboard operation
mwoffliner copied to clipboard

[Feature Request] Add audio files from Wiktionaries into .zim files

Open kelson42 opened this issue 1 year ago • 9 comments

From overview created by ghost: kiwix/overview#60

@kelson42 Some Wiktionaries contain a huge number of pronunciations in .ogg format by native speakers. As an example, the German Wiktionary contains around 700,000 pronunciations.

The current .zim files of Wiktionaries usually contain only plain text and do not include the audio files.

Language learners in developing countries with no internet connection cannot access online websites to listen pronunciations. Wiktionaries include IPA transcriptions, but it is not enough. The actual pronunciation of a native speaker is very helpful.

Would it be possible to include pronunciations from Wiktionaries in .zim files offered by Kiwix ?

EDIT: _- A method to reduce the file size of .ogg format audios would be converting them into .opus format. That format conversion can reduce the size of audios by 60-70%.

Here is a screenshot of a German Wiktionary .zim file used on GOLDENDICT :

image

image

PS. 1) I use the German Wiktionary in .zim format with GoldenDict. I live in a remote village in South America. Thank you very much, really ! The Kiwix Project has saved me because I almost never have internet connection.

  1. The German Wiktionary currently weighs 1.4 GiB. If pronunciations in .opus format are added, it would weight around 5-6 GiB.

  2. The Wiktionaries with more audios are English, French, and German. If compression is used in audio files, the tradeoff would be reasonable. At most 3-4 GiB would be added to Wiktionaries in the main languages. Other languages as Spanish and Italian have fewer pronunciations, and it would be less than 1GiB to be added to .zim files.

kelson42 avatar Aug 13 '22 11:08 kelson42

Some Wiktionaries (e.g. French) might have >10 pronunciations per word. For example the word "maison" has 13 audio files.

As far as I know, only the French Wiktionary has so many audios because they have a separate section for Pronunciation and add audios in bulk using @lingua-libre [ www.lingualibre.org ]. Suggestions to reduce the size of .zim files with audios:

  • Limit the number of .ogg files per entry (e.g. max. 3)
  • convert .ogg file to another format for compression (e.g., .opus)

image

kelson42 avatar Aug 13 '22 11:08 kelson42

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

kelson42 avatar Aug 13 '22 11:08 kelson42

dafuq is wrong with these people. I'm francophone and can maybe pick three differences within these twelve. We need to parse and pick up one sound, otherwise we'll end up with tons of junk like the above.

kelson42 avatar Aug 13 '22 11:08 kelson42

Some automatic tools can be used to evaluate the pronounciation.

kelson42 avatar Aug 13 '22 11:08 kelson42

We can do so, mwoffliner can do it... but this is a bit subptil. We want the ogg but not the video files for example. We should have a look in detail how tod o.

kelson42 avatar Aug 13 '22 11:08 kelson42

This makes a lot of sense, at least in a Wiktionary context. I'm not super convinced that adding yet another flavour to the existing three (mini/nopic/maxi) would do us any good. The question at this stage would therefore be: could we tweak mwoffliner so that it produces wiktionary files that include the .ogg files in maxi?

kelson42 avatar Aug 13 '22 11:08 kelson42

@kelson42 Does Wiktionary actually have lots of video files? My impression is that video is very rare, if used at all. So in that case, you could just try an 'all'/full scrape (not maxi) on a smaller dictionary like the Spanish one for testing the file size increase, assuming that mwOffliner can pick up .ogg audio when it's in full mode. Wikitionaries are not very big ZIM files anyway, compared to Wikipedia...

Jaifroid avatar Sep 10 '22 11:09 Jaifroid

@kelson42 Does Wiktionary actually have lots of video files?

Yes, but a few audio files. Actually for some reason it seems we have already a few audio files. I need to assess the situation.

kelson42 avatar Sep 10 '22 13:09 kelson42

There are 3 Wiktionaries with planty of audio files:

  • English
  • German
  • French

The German version has 800.000 audios approximately. Around 98% of audios are in .ogg format but there are also .wav files.

Videos are so rare that they can be ignored. However, pronunciations from Wiktionary are extremely useful for language learners with no internet connection.

Just imagine, how can someone learn to pronounce English words without being able to listen them ?

This issue is quite important for young students learning languages and the size of audio files is not so big in comparison with the Wikipedia.

Immunize2 avatar Oct 04 '22 15:10 Immunize2

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

stale[bot] avatar May 26 '23 17:05 stale[bot]