lam icon indicating copy to clipboard operation
lam copied to clipboard

Add dataset: old_book_illustrations

Open giganttheo opened this issue 3 years ago • 14 comments

A URL for this dataset

https://www.oldbookillustrations.com/

Dataset description

The Old Book Illustrations website contains a dataset of illustrations scanned from old books. Each illustration page also contains infos about the illustrator, the illustration and the book it's taken from as well as a title, a description, and a few keywords. As of today, the website contains 3150 images.

I already wrote a script to scrap all the content since the api does not give access to all the information (for instance the image is not is the best resolution).

Is it a dataset that is relevant for this project?

About the license, the website reads:

  • Text content (descriptions, translations, etc.) is published under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
  • Although we do our best to offer only Illustrations that are considered public domain in most countries, copyright laws vary from one jurisdiction to another, and you agree that you are solely responsible for abiding by all laws and regulations that may be applicable to using the Illustrations.

More info on the term of use page.

Dataset modality

Image

Dataset licence

Creative Commons Public Domain Dedication and Certification

Other licence

No response

How can you access this data

Other

Confirm the dataset has an open licence

  • [X] To the best of my knowledge, this dataset is accessible via an open licence

Contact details for data custodian

No response

giganttheo avatar Jul 22 '22 07:07 giganttheo

@giganttheo, thanks for suggesting this. I think it's a super interesting dataset, but I have a few questions about how we could access this dataset. On the terms of use they say:

You are welcome to download as many pictures as you wish, with no restriction in time or quantity; but we do not approve of the use of offline browsing software, or website downloaders, such as HTTRack, WebReaper, etc, due to the heavy load they put on the server. Please don’t use them.

I think a scraping script would likely fall under this category. I suggest that it might be worth reading out to the website creators to ask if they would be keen to contribute a dataset derived from the site. It may also be possible to get some additional metadata about the items. In particular, it would be helpful to have a citation to the source for each image included in the dataset, so it's possible to confirm the copyright status of those items if needed.

WDYT?

davanstrien avatar Jul 25 '22 10:07 davanstrien

Yeah you are right. I just sent an email to the contact adress from the website, to ask for a mirror or a special authorization to use a scraping tool. I'll update this if I have a response.

giganttheo avatar Jul 26 '22 14:07 giganttheo

Yeah you are right. I just sent an email to the contact adress from the website, to ask for a mirror or a special authorization to use a scraping tool. I'll update this if I have a response.

Thanks! would be great to have this available so hopefully they are keen :)

davanstrien avatar Jul 26 '22 14:07 davanstrien

Update, I received a response from the webmaster:

Hi Théo, Thank you for your interest in oldbookillustrations.com. We do indeed restrict bulk downloads, simply because we're on fairly cheap hosting and are concerned the server might not withstand the strain of full-on pounding that's often involved in site-scraping. Having said that, I have no objection to allowing you a one-time access to get the data needed for your project, provided we can agree on "gentle" settings for your scraper. I would imagine that allowing around 3 to 5 seconds between the download of each image file, and a pause of about 10 seconds every ten downloads would be safe. I'm not sure what else you would need and how many hits that would generate, but the same amount of caution would be in order. Even better if the bulk of the activity can take place between 3 am and 7 am GMT. The total weight of the available image files is around 8 Go. Hope this helps.
With best regards,
Harvey Livet,
Webmaster of oldbookillustrations.com

I will use a scraping tool by respecting the restrictions agreed upon. And then upload the dataset to the hub!

giganttheo avatar Jul 27 '22 08:07 giganttheo

Awesome, the only other I would check is that when you download the images we can get sufficient metadata for each image to verify the licence/copyright. What information is downloaded at the moment?

davanstrien avatar Jul 27 '22 08:07 davanstrien

We have access to: the artist name (of the illustration), the engravers, the book and author (of the book), as well as the source of the illustration, and the Open Library record. For instance, you can check this page: https://www.oldbookillustrations.com/illustrations/pula-temple-augustus/

All the information that is shown on this page can be scraped. (I prefer scraping the page than downloading the "json record" that lacks some image sizes, and the keywords for instance)

giganttheo avatar Jul 27 '22 08:07 giganttheo

The source reads:

The New York Public Library believes that this item is in the public domain under the laws of the United States, but did not make a determination as to its copyright status under the copyright laws of other countries.

giganttheo avatar Jul 27 '22 08:07 giganttheo

Great that looks good. I think if we can include the source information/URL that would be great. My own preference would also to be include as much information as possible for each image since it may be useful for someone working with the data WDYT?

davanstrien avatar Jul 27 '22 08:07 davanstrien

Last night, I scraped the pages from the website, by following the restrictions agreed upon. This is the resulting dataset, stored on the hub: https://huggingface.co/datasets/gigant/oldbookillustrations_2

Do you think any other information might be interesting? There is most of the data from the pages, with the urls, and the sources.

If that's ok for you, I can add it to the BigLAM org, and create a comprehensive dataset card.

giganttheo avatar Jul 28 '22 08:07 giganttheo

Last night, I scraped the pages from the website, by following the restrictions agreed upon. This is the resulting dataset, stored on the hub: huggingface.co/datasets/gigant/oldbookillustrations_2

Do you think any other information might be interesting? There is most of the data from the pages, with the urls, and the sources.

If that's ok for you, I can add it to the BigLAM org, and create a comprehensive dataset card.

This looks amazing! Happy for you to move to the BigLAM org -- let me know if you want any help with the dataset card.

davanstrien avatar Jul 28 '22 15:07 davanstrien

Update:

Here is the dataset on the hub, with a comprehensive dataset card: https://huggingface.co/datasets/biglam/oldbookillustrations

Let me know what you think can be improved.

giganttheo avatar Aug 18 '22 08:08 giganttheo

Thanks so much for this. Having given this a bit more thought, I think it probably makes sense to try and filter out the items which may have copyright issues. I think the best way to do this would be to filter out based on the artist_date field and set a reasonably conservative threshold. My suggestion would be to push this to a new dataset and keep the other one private. This means we can use the current version to update the cleaned version in the future.

@giganttheo WDYT?

davanstrien avatar Aug 18 '22 08:08 davanstrien

Thanks so much for this. Having given this a bit more thought, I think it probably makes sense to try and filter out the items which may have copyright issues. I think the best way to do this would be to filter out based on the artist_date field and set a reasonably conservative threshold. My suggestion would be to push this to a new dataset and keep the other one private. This means we can use the current version to update the cleaned version in the future.

@giganttheo WDYT?

suggested approach in this notebook https://gist.github.com/davanstrien/e34e239cbf792057f79e2e2162d1e4b1

davanstrien avatar Aug 18 '22 09:08 davanstrien

I agree that it would be nice to make sure the version we share does not have any copyright infringement issue. However, from my understanding, checking if a work is public domain might not be as straightforward as the filter you set up, since it depends on whether the work was published or not, as well as the country of origin. According to the "public domain" Wikipedia Page:

Determination of whether a copyright has expired depends on an examination of the copyright in its source country.

I think adding a filter to know if an artwork is public domain is a good idea, but it will require a lot more work: the dataset shall include some information I missed out (I just included the artist birth date for instance, but really it's the death date that is more important in that case), and we neet some basics of copyright law for the source countries of the artworks.

In my opinion, we could keeping a complete version available, with a warning about copyright issues at first, and then when a new version is ready we could add another one with the public domain artworks only. What do you think?

I will investigate this when I have some time.

giganttheo avatar Aug 22 '22 09:08 giganttheo