WiktionaryParser
WiktionaryParser copied to clipboard
Supporting German as base language
I want to use this project, but I would like to use German wiktionary. I intend to fork off this project and make the required adaptions. Is there any interest in merging the result back via a PR? It would require some structural changes, but adding more languages later might be easier.
Sure, a PR to support German would be great! You can fetch results in your local language from the English Wiktionary though
I'm aware I can get definitions in English of words in other languages. The problem is that the English version of Wiktionary has much fewer German words than the German version, and I also think there is value in using the language your learning FOR learning, ones you reach that level of maturity, which is why I think being able to use different languages versions is nice.
What I'm learning is that German Wiktionary structures it's content much differently from English Wiktionary, so I think I will need to reinvent the wheel. Will make a PR when I'm done!
Suyash Behera [email protected] schrieb am Do., 16. Jan. 2020, 4:18 PM:
Sure, a PR to support German would be great! You can fetch results in your local language from the English Wiktionary though
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Suyash458/WiktionaryParser/issues/55?email_source=notifications&email_token=ACBGJJ6BLXZ5QHBPG5VOM4DQ6B3GHA5CNFSM4KGH4PQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJENUJA#issuecomment-575199780, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACBGJJYWJ5NKJNI7IE7A4UDQ6B3GHANCNFSM4KGH4PQA .
Yeah I'd wrongly assumed that the page structures for different wikis would be somewhat similar. Good luck with the PR! Let me know if I can help in any way.
I'm aware I can get definitions in English of words in other languages. The problem is that the English version of Wiktionary has much fewer German words than the German version, and I also think there is value in using the language your learning FOR learning, ones you reach that level of maturity, which is why I think being able to use different languages versions is nice. What I'm learning is that German Wiktionary structures it's content much differently from English Wiktionary, so I think I will need to reinvent the wheel. Will make a PR when I'm done! Suyash Behera [email protected] schrieb am Do., 16. Jan. 2020, 4:18 PM: … Sure, a PR to support German would be great! You can fetch results in your local language from the English Wiktionary though — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#55?email_source=notifications&email_token=ACBGJJ6BLXZ5QHBPG5VOM4DQ6B3GHA5CNFSM4KGH4PQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJENUJA#issuecomment-575199780>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACBGJJYWJ5NKJNI7IE7A4UDQ6B3GHANCNFSM4KGH4PQA .
Any update on this? I would be very interested in that code as well. Would not want to code it if someone else already did :D
Started but got distracted. Don't have much. Code away!
Felix [email protected] schrieb am Fr., 8. Mai 2020, 1:48 PM:
I'm aware I can get definitions in English of words in other languages. The problem is that the English version of Wiktionary has much fewer German words than the German version, and I also think there is value in using the language your learning FOR learning, ones you reach that level of maturity, which is why I think being able to use different languages versions is nice. What I'm learning is that German Wiktionary structures it's content much differently from English Wiktionary, so I think I will need to reinvent the wheel. Will make a PR when I'm done! Suyash Behera [email protected] schrieb am Do., 16. Jan. 2020, 4:18 PM: … <#m_7723713051758412527_> Sure, a PR to support German would be great! You can fetch results in your local language from the English Wiktionary though — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#55 https://github.com/Suyash458/WiktionaryParser/issues/55?email_source=notifications&email_token=ACBGJJ6BLXZ5QHBPG5VOM4DQ6B3GHA5CNFSM4KGH4PQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJENUJA#issuecomment-575199780>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACBGJJYWJ5NKJNI7IE7A4UDQ6B3GHANCNFSM4KGH4PQA .
Any update on this? I would be very interested in that code as well. Would not want to code it if someone else already did :D
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Suyash458/WiktionaryParser/issues/55#issuecomment-625777585, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACBGJJ5AQBNP2GDSXRAJY7DRQPWP5ANCNFSM4KGH4PQA .
@DieseKartoffel I haven't started working on this, feel free to go ahead!
Just noticed this recently... I've been doing some work on a local fork to support other languages, though I'm only interested in definitions. Basically the original code assumes there's a Table of Contents (TOC) and parses the page data from that. German words don't always have that. So basically I manually create one by checking the nested headers in the page and looking for ones that match the language code and then the part of speech.
https://github.com/rroessler1/WiktionaryParser/commit/ae8fb901e87caea947db718999bf8146c050a0fd
Though I think it'd be cleaner to have a base parsing class and then override certain methods for different languages, but for now I'm taking the lazy approach.
I'm happy to clean it up a bit and submit a PR, but I think it would need a slightly larger design discussion of the best way to support multiple languages going forward, which to my knowledge hasn't happened yet.
@rroessler1 I'd initially made this project for use in a Telegram dictionary bot(used a different dictionary service instead) and didn't think of supporting other languages. It certainly needs design changes which I think should handle different types of pages instead of specific languages. There could be words in languages other than German that don't have a ToC for example. I was thinking of handling parsing in stages where the first stage tries to figure out the structure of the page from a ToC or by checking the nested headers if a ToC isn't found. For now we could add your changes to this stage and incrementally support more languages. What do you think?
Agreed, I like the idea of handling it in stages and keeping it language-agnostic if possible.
But I think eventually there will have to be language-specific code as well. For example in German the meaning, origins, synonyms are all listed under one header under <p> tags, so I think the parser would just have to search for and extract the text under "Bedeutungen:", which would be a German-specific bit of code. Can't think of any way around this at the moment, unless you pass the responsibility off to the client. (example: essen)
Is there any updates on this? Maybe a side branch or something?
@johnnybigoode I haven't been working on supporting other languages but maybe @rroessler1 has a fork that works?
I have a fork that supports Spanish French and German.
https://github.com/rroessler1/WiktionaryParser
It definitely works, but I haven't looked at it in a year so I'm not sure if it's missing new updates.
I would be happy to try and get it merged into here but definitely don't have time until August at the earliest.