bnl_ground_truth_newspapers_before_1878
A URL for this dataset
https://data.bnl.lu/data/historical-newspapers/
Dataset description
33.000 transcribed text lines from historical newspapers (before 1878) along with the cropped images of the original scans
Text line based OCR 19.000 text lines in Antiqua 14.000 text lines in Fraktur Transcribed using double-keying (99.95% accuracy) Public Domain, CC0 (See copyright notice) Best for training an OCR engine
The newspapers used are:
- Le Gratis luxembourgeois (1857-1858)
- Luxemburger Volks-Freund (1869-1876)
- L'Arlequin (1848-1848)
- Courrier du Grand-Duché de Luxembourg (1844-1868)
- L'Avenir (1868-1871)
- Der Wächter an der Sauer (1849-1869)
- Luxemburger Zeitung (1844-1845)
- Luxemburger Zeitung = Journal de Luxembourg (1858-1859)
- Der Volksfreund (1848-1849)
- Cäcilia (1862-1871)
- Kirchlicher Anzeiger für die Diözese Luxemburg (1871-1878)
- L'Indépendance luxembourgeoise (1871-1878)
- Luxemburger Anzeiger (1856)
- L'Union (1860-1871)
- Diekircher Wochenblatt (1837-1848)
- Das Vaterland (1869-1870)
- D'Wäschfra (1868-1878)
- Luxemburger Bauernzeitung (1857)
- Luxemburger Wort (1848-1878)
Dataset modality
Mixed
Dataset licence
Creative Commons Public Domain Dedication and Certification
Other licence
No response
How can you access this data
As a download from a repository/website
size of dataset
500MB-2GB
Confirm the dataset has an open licence
- [X] To the best of my knowledge, this dataset is accessible via an open licence
Contact details for data custodian
I transformed the original dataset slightly into jsonl and zipped the images
https://huggingface.co/ymaurer/bnl_ground_truth_newspapers_before_1878
Moved the dataset to the biglam organisation biglam/bnl_ground_truth_newspapers_before_1878
Moved the dataset to the biglam organisation biglam/bnl_ground_truth_newspapers_before_1878
I think this got created as a model, so I've just moved it to a dataset. I think it could also be good to write a loading script for this to make the data easier to load using the datasets library. I'll hopefully have some time to help with that later this week.