dagw_page
dagw_page copied to clipboard
The Danish Gigaword project
It’s hard to develop good tools for Danish NLP when no large and wide-coverage corpus is readily available. To address this, we're building a gigaword corpus with over a billion words (10^9). This is the homepage for the project. The overriding goals are to create a dataset that is 1. representative; 2. accessible; 3. a suitable “fixed point” for Danish NLP.
Licensing
To make the corpus accessible, all parts of the corpus must be licensed openly, for free distribution. An example license is something like Creative Commons general license (CC0) or CC-BY.
Working paper
Details on the corpus are maintained at arXiv:2005.03521.
Breadth
Danish Gigaword should cover variation along a variety of dimensions, including:
- Time of authorship;
- Speech situation;
- Modality;
- Domain;
- Register;
- Age of utterer;
- Dialect of utterer;
- Socioeconomic status of utterer.
This is an intentionally strong departure from early editions of English Gigaword that focused on Newswire; criterion (1) of the corpus, representativity, requires that one go beyond newswire. This is mandatory if the corpus is to cover enough words and language uses to be general-purpose.
Timeline
We anticipate an initial release of the corpus in early 2021.
Contact
For info about joining the project, contact Leon Strømberg-Derczynski - [email protected]
Members:
- Leon Strømberg-Derczynski, ITU (lead)
- Rebekah Baglini, AU
- Morten H. Christiansen, Cornell / AU
- Manuel Ciosici, ITU / ISI, University of Southern California
- Jacob Aarup Dalsgaard, AU
- Riccardo Fusaroli, AU
- Peter Juel Henrichsen, Dansk Sprognævn
- Rasmus Hvingelby, Fraunhofer Institute
- Andreas Kirkedal, ITU / Interactions LLC
- Alex Speed Kjeldsen, KU
- Claus Ladefoged, TV2 Regionerne
- Finn Årup Nielsen, DTU
- Amalie Brogaard Pauli, Alexandra Instituttet
- Malte Lau Petersen, AU
- Jonathan Hvithamar Rystrøm, AU
- Daniel Varab, ITU / Novo Nordisk
Supporters:
- Jørg Asmussen, Det Danske Sprog- og Litteraturselskab
- Jens Dahl Møllerhøj, BotXO
- Bolette Sandford Pedersen, KU / Center for Sprogteknologi