data_tooling icon indicating copy to clipboard operation
data_tooling copied to clipboard

Create dataset british_library_hertiage_made_digital_newspapers

Open albertvillanova opened this issue 3 years ago • 16 comments

  • uid: british_library_hertiage_made_digital_newspapers
  • type: primary
  • description:
    • name: British Library Hertiage Made Digital Newspapers
    • description: This is a collection of copyright-cleared 19th Century newspapers held by the British Library.
    • homepage: https://bl.iro.bl.uk/collections/353c908d-b495-4413-b047-87236d2573e3?locale=en
    • validated: True
  • languages:
    • language_names:
      • English
    • language_comments:
    • language_locations:
      • Northern Europe
      • United Kingdom
    • validated: False
  • custodian:
    • name: British Library Board
    • in_catalogue:
    • type: A library, museum, or archival institute
    • location: United Kingdom
    • contact_name: Daniel van Strien
    • contact_email: [email protected]
    • contact_submitter: False
    • additional: https://www.bl.uk/
    • validated: False
  • availability:
    • procurement:
      • for_download: Yes - it has a direct download link or links
      • download_url: https://bl.iro.bl.uk/collections/353c908d-b495-4413-b047-87236d2573e3?locale=en
      • download_email:
    • licensing:
      • has_licenses: Yes

      • license_text: No Copyright - Other Known Legal Restrictions

        Use of this Item is not restricted by copyright and/or related rights. In one or more jurisdictions, laws other than copyright are known to impose restrictions on the use of this Item. Please refer to the organization that has made the Item available for more information. Notices

        Unless expressly stated otherwise, the organization that has made this Item available makes no warranties about the Item and cannot guarantee the accuracy of this Rights Statement. You are responsible for your own use.
        You may find additional information about the copyright status of the Item on the website of the organization that has made the Item available.
        You may need to obtain other permissions for your intended use. For example, other rights such as publicity, privacy or moral rights may limit how you may use the material.
        

        DISCLAIMER The purpose of this statement is to help the public understand how this Item may be used. When there is a (non-standard) License or contract that governs re-use of the associated Item, this statement only summarizes the effects of some of its terms. It is not a License, and should not be used to license your Work. To license your own Work, use a License offered at https://creativecommons.org/

      • license_properties:

        • public domain
      • license_list:

    • pii:
      • has_pii: Yes
      • generic_pii_likely: somewhat likely
      • generic_pii_list:
        • names
        • physical addresses
        • dates (birth, death, etc.)
      • numeric_pii_likely: none
      • numeric_pii_list:
      • sensitive_pii_likely: somewhat likely
      • sensitive_pii_list:
        • political opinions
        • trade-union membership
        • religious or philosophical beliefs
        • racial or ethnic origin
        • health-related data
        • data concerning a person's sex life or sexual orientation
      • no_pii_justification_class:
      • no_pii_justification_text:
    • validated: False
  • source_category:
    • category_type: collection
    • category_web:
    • category_media: news articles
    • validated: False
  • media:
    • category:
      • text
    • text_format:
      • other
      • ALTO XML
    • audiovisual_format:
    • image_format:
      • .TIFF
    • database_format:
    • text_is_transcribed: Yes - image
    • instance_type: A year of publications for a newspaper title
    • instance_count: 100<n<1K
    • instance_size: 100<n<10,000
    • validated: False
  • fname: british_library_hertiage_made_digital_newspapers.json

albertvillanova avatar Nov 23 '21 11:11 albertvillanova

#self-assign

cakiki avatar Nov 29 '21 17:11 cakiki

@cakiki give me a shout if you want any help with this? I am quite familiar with this dataset :)

davanstrien avatar Dec 01 '21 15:12 davanstrien

@davanstrien You've already helped a lot with your script which I used to download all the data. I'm currently uploading all the .zip files to the hub which will probably take a while.

(For the record the download script is the following: https://github.com/Living-with-machines/hmd_newspaper_dl)

cakiki avatar Dec 01 '21 20:12 cakiki

https://huggingface.co/datasets/bigscience-catalogue-data/british_library_heritage_made_digital_newspapers

albertvillanova avatar Dec 03 '21 10:12 albertvillanova

Done

cakiki avatar Dec 03 '21 11:12 cakiki

Thanks a lot @cakiki!!!

I just left a comment to address this issue later:

This dataset takes too long to load because of the data format inferring. This is due to the compression with zip and should be fixed if compressed with gzip instead.

ds_name = "bigscience-catalogue-data/british_library_heritage_made_digital_newspapers"
ds = load_dataset(ds_name, split="train", streaming=True, use_auth_token=True)

@lhoestq, maybe we should warn about this in the docs?

albertvillanova avatar Dec 03 '21 15:12 albertvillanova

Dataset came zipped. Should I convert everything to gzip?

Side question: what compression level would you recommend?

cakiki avatar Dec 04 '21 10:12 cakiki

The dataset looks fine as ZIP, maybe we could optimize the data format inference so that it doesn't have to iterate over each single zip file. We can decide on a maximum number of files (possibly inside archives) to check for example ? WDYT @albertvillanova ?

lhoestq avatar Dec 06 '21 11:12 lhoestq

PR to fix the issue of taking too long to iterate over all data files:

  • huggingface/datasets#3407

albertvillanova avatar Dec 08 '21 15:12 albertvillanova

Need support for ZIP:

  • huggingface/datasets#3375
ds = load_dataset("bigscience-catalogue-data/british_library_heritage_made_digital_newspapers", split="train", streaming=True, use_auth_token=True)
item = next(iter(ds))

albertvillanova avatar Dec 13 '21 16:12 albertvillanova

ERROR:

FileNotFoundError: Couldn't find a dataset script at huggingface/datasets/bigscience-catalogue-data/british_library_heritage_made_digital_newspapers/british_library_heritage_made_digital_newspapers.py or any data file in the same directory. Couldn't find 'bigscience-catalogue-data/british_library_heritage_made_digital_newspapers' on the Hugging Face Hub either: FileNotFoundError: No data files or dataset script found in bigscience-catalogue-data/british_library_heritage_made_digital_newspapers

albertvillanova avatar Dec 16 '21 19:12 albertvillanova

I think the the loading script should parse the XML files.

CC: @davanstrien

albertvillanova avatar Jan 24 '22 09:01 albertvillanova

I think the the loading script should parse the XML files.

CC: @davanstrien

I have a WIP script I have been working on for this. If it's helpful, I can share that? I am also working with some colleagues to get a plain text version of this dataset on the BL repository, but that will still take a bit longer to get ready.

davanstrien avatar Jan 24 '22 09:01 davanstrien

Great @davanstrien !

You can do as you prefer...Maybe the fastest would be to get the script (to have the data available internally for the BigScience project). Eventually you could make the script publicly available either in a community dataset (in your org) or as a canonical dataset (opening a Pull Request in the lilbrary)...

albertvillanova avatar Jan 24 '22 09:01 albertvillanova

Great - I will try and get the script finished today for use in BigScience. I might then hold off with a public script until we have the plain text version of the data available since that will be quicker to parse.

davanstrien avatar Jan 24 '22 09:01 davanstrien

@albertvillanova, sorry this took a bit longer. I did write a loading script, but because the XML processing is relatively slow for this data, the loading script was very slow, and I think it would cause issues. I, therefore, pre-processed the data to extract the plain text and some minimal metadata. This is currently pushed to my HF hub (https://huggingface.co/datasets/davanstrien/hmd_newspapers)

Currently, each row represents an article in the newspaper. Since this is detected by an imperfect OCR segmentation tool from the digitised image, these articles are not always semantically meaningful. In particular, it can lead to very short or long articles. This could be dealt with quite easily later on, but I could also push a version of the data at the page level if this will be more efficient to use for the training (the lengths of the texts will be much longer for each example).

Either way, if you are happy with either of these approaches, I can transfer the dataset from my hub to the BigScience space.

davanstrien avatar Jan 27 '22 20:01 davanstrien