corpuscrawler
corpuscrawler copied to clipboard
Add (Modern Standard) Arabic language
Is there any work being done regarding any Arabic dialects?
We can start with http://www.dw.com/ar/
, which is Modern Standard Arabic. I think MSA is a good start, and we can add regional dialects later.
Please list here any source you think we should add, for MSA or regional dialects.
Are these Modern Standard Arabic, too? Adding them would be a matter of 1 line each. http://www.bbc.com/arabic https://arabic.sputniknews.com/
Seeds for crawling a language corpus in Maroccan Arabic (BCP47 language code ary
):
- http://archive.cawalisse.com/sitemap.xml
- http://ahdath.info/ — no sitemap, but uses /category so
util.find_wordpress_urls()
might work - https://www.aljamaa.net/ar/ — couldn’t find the sitemap, but also uses /cateogry urls
For Algerian Arabic (BCP47 language code arq
), see http://www.onlinenewspapers.com/algeria.htm but I wouldn’t know if any of these are in Standard Arabic
Yes, @brawer. These two are definitely both Standard Arabic: http://www.bbc.com/arabic https://arabic.sputniknews.com/
About the country-specific news services, I can't tell if we they are in local dialects of Standard Arabic, or the regional Arabic. So, I think we need to ask some help reviewing them one by one.
Actually, ar
/ara
is the macrolanguage, and we better us arb
for Standard Arabic. That's not what websites do, but I think it's safe to make the assumption about the content here to be arb
. What do you think?
So far, I’ve tried to follow the BCP47 language tags as per Unicode conventions. There, macrolanguage codes stand for the individual language that “everyone” (a typical webmaster or programmer who isn’t deeply rooted in the internationalization scene) means when they see that code. For example, according to Unicode/ICU/CLDR, the code for Estonian is et
instead of ekk
; the code for Modern Standard Ararbic is ar
instead of arb
; the code for Uzbek is uz
instead of uzn
; the code for Mandarin is zh
; etc. For the full list, see the languageAlias data in CLDR.
Cool! Yeah, that's what I though is happening here, but wasn't sure.
About the other links, hopefully regional Arabic sources, I'll send an update as soon as I get more info.
Regarding ary
: According to an Arabic speaker, https://www.hespress.com/ (sitemap) might be a source for building a language corpus in Moroccan Arabic. My contact said that the Moroccan newspapers listed earlier on this bug are in Modern Standard Arabic, whereas some (but not all) comments on these sites are in Moroccan dialect.
hespress.com is definitely MSA (including most comments on the random articles I checked). Actually, you are unlikely to find any newspapers in local dialects, your best bet would be forums and the likes that are considered less “formal”.