corpuscrawler icon indicating copy to clipboard operation
corpuscrawler copied to clipboard

Add (Modern Standard) Arabic language

Open behnam opened this issue 7 years ago • 9 comments

Is there any work being done regarding any Arabic dialects?

We can start with http://www.dw.com/ar/, which is Modern Standard Arabic. I think MSA is a good start, and we can add regional dialects later.

Please list here any source you think we should add, for MSA or regional dialects.

behnam avatar Oct 23 '17 23:10 behnam

Are these Modern Standard Arabic, too? Adding them would be a matter of 1 line each. http://www.bbc.com/arabic https://arabic.sputniknews.com/

brawer avatar Oct 24 '17 10:10 brawer

Seeds for crawling a language corpus in Maroccan Arabic (BCP47 language code ary):

  • http://archive.cawalisse.com/sitemap.xml
  • http://ahdath.info/ — no sitemap, but uses /category so util.find_wordpress_urls() might work
  • https://www.aljamaa.net/ar/ — couldn’t find the sitemap, but also uses /cateogry urls

brawer avatar Oct 24 '17 10:10 brawer

For Algerian Arabic (BCP47 language code arq), see http://www.onlinenewspapers.com/algeria.htm but I wouldn’t know if any of these are in Standard Arabic

brawer avatar Oct 24 '17 10:10 brawer

Yes, @brawer. These two are definitely both Standard Arabic: http://www.bbc.com/arabic https://arabic.sputniknews.com/

About the country-specific news services, I can't tell if we they are in local dialects of Standard Arabic, or the regional Arabic. So, I think we need to ask some help reviewing them one by one.

behnam avatar Oct 24 '17 16:10 behnam

Actually, ar/ara is the macrolanguage, and we better us arb for Standard Arabic. That's not what websites do, but I think it's safe to make the assumption about the content here to be arb. What do you think?

behnam avatar Oct 24 '17 21:10 behnam

So far, I’ve tried to follow the BCP47 language tags as per Unicode conventions. There, macrolanguage codes stand for the individual language that “everyone” (a typical webmaster or programmer who isn’t deeply rooted in the internationalization scene) means when they see that code. For example, according to Unicode/ICU/CLDR, the code for Estonian is et instead of ekk; the code for Modern Standard Ararbic is ar instead of arb; the code for Uzbek is uz instead of uzn; the code for Mandarin is zh; etc. For the full list, see the languageAlias data in CLDR.

brawer avatar Oct 25 '17 06:10 brawer

Cool! Yeah, that's what I though is happening here, but wasn't sure.

About the other links, hopefully regional Arabic sources, I'll send an update as soon as I get more info.

behnam avatar Oct 25 '17 06:10 behnam

Regarding ary: According to an Arabic speaker, https://www.hespress.com/ (sitemap) might be a source for building a language corpus in Moroccan Arabic. My contact said that the Moroccan newspapers listed earlier on this bug are in Modern Standard Arabic, whereas some (but not all) comments on these sites are in Moroccan dialect.

brawer avatar Oct 25 '17 13:10 brawer

hespress.com is definitely MSA (including most comments on the random articles I checked). Actually, you are unlikely to find any newspapers in local dialects, your best bet would be forums and the likes that are considered less “formal”.

khaledhosny avatar Oct 28 '17 02:10 khaledhosny