Add dataset: old_book_illustrations
A URL for this dataset
https://www.oldbookillustrations.com/
Dataset description
The Old Book Illustrations website contains a dataset of illustrations scanned from old books. Each illustration page also contains infos about the illustrator, the illustration and the book it's taken from as well as a title, a description, and a few keywords. As of today, the website contains 3150 images.
I already wrote a script to scrap all the content since the api does not give access to all the information (for instance the image is not is the best resolution).
Is it a dataset that is relevant for this project?
About the license, the website reads:
- Text content (descriptions, translations, etc.) is published under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
- Although we do our best to offer only Illustrations that are considered public domain in most countries, copyright laws vary from one jurisdiction to another, and you agree that you are solely responsible for abiding by all laws and regulations that may be applicable to using the Illustrations.
More info on the term of use page.
Dataset modality
Image
Dataset licence
Creative Commons Public Domain Dedication and Certification
Other licence
No response
How can you access this data
Other
Confirm the dataset has an open licence
- [X] To the best of my knowledge, this dataset is accessible via an open licence
Contact details for data custodian
No response
@giganttheo, thanks for suggesting this. I think it's a super interesting dataset, but I have a few questions about how we could access this dataset. On the terms of use they say:
You are welcome to download as many pictures as you wish, with no restriction in time or quantity; but we do not approve of the use of offline browsing software, or website downloaders, such as HTTRack, WebReaper, etc, due to the heavy load they put on the server. Please don’t use them.
I think a scraping script would likely fall under this category. I suggest that it might be worth reading out to the website creators to ask if they would be keen to contribute a dataset derived from the site. It may also be possible to get some additional metadata about the items. In particular, it would be helpful to have a citation to the source for each image included in the dataset, so it's possible to confirm the copyright status of those items if needed.
WDYT?
Yeah you are right. I just sent an email to the contact adress from the website, to ask for a mirror or a special authorization to use a scraping tool. I'll update this if I have a response.
Yeah you are right. I just sent an email to the contact adress from the website, to ask for a mirror or a special authorization to use a scraping tool. I'll update this if I have a response.
Thanks! would be great to have this available so hopefully they are keen :)
Update, I received a response from the webmaster:
Hi Théo, Thank you for your interest in oldbookillustrations.com. We do indeed restrict bulk downloads, simply because we're on fairly cheap hosting and are concerned the server might not withstand the strain of full-on pounding that's often involved in site-scraping. Having said that, I have no objection to allowing you a one-time access to get the data needed for your project, provided we can agree on "gentle" settings for your scraper. I would imagine that allowing around 3 to 5 seconds between the download of each image file, and a pause of about 10 seconds every ten downloads would be safe. I'm not sure what else you would need and how many hits that would generate, but the same amount of caution would be in order. Even better if the bulk of the activity can take place between 3 am and 7 am GMT. The total weight of the available image files is around 8 Go. Hope this helps.
With best regards,
Harvey Livet,
Webmaster of oldbookillustrations.com
I will use a scraping tool by respecting the restrictions agreed upon. And then upload the dataset to the hub!
Awesome, the only other I would check is that when you download the images we can get sufficient metadata for each image to verify the licence/copyright. What information is downloaded at the moment?
We have access to: the artist name (of the illustration), the engravers, the book and author (of the book), as well as the source of the illustration, and the Open Library record. For instance, you can check this page: https://www.oldbookillustrations.com/illustrations/pula-temple-augustus/
All the information that is shown on this page can be scraped. (I prefer scraping the page than downloading the "json record" that lacks some image sizes, and the keywords for instance)
The source reads:
The New York Public Library believes that this item is in the public domain under the laws of the United States, but did not make a determination as to its copyright status under the copyright laws of other countries.
Great that looks good. I think if we can include the source information/URL that would be great. My own preference would also to be include as much information as possible for each image since it may be useful for someone working with the data WDYT?
Last night, I scraped the pages from the website, by following the restrictions agreed upon. This is the resulting dataset, stored on the hub: https://huggingface.co/datasets/gigant/oldbookillustrations_2
Do you think any other information might be interesting? There is most of the data from the pages, with the urls, and the sources.
If that's ok for you, I can add it to the BigLAM org, and create a comprehensive dataset card.
Last night, I scraped the pages from the website, by following the restrictions agreed upon. This is the resulting dataset, stored on the hub: huggingface.co/datasets/gigant/oldbookillustrations_2
Do you think any other information might be interesting? There is most of the data from the pages, with the urls, and the sources.
If that's ok for you, I can add it to the BigLAM org, and create a comprehensive dataset card.
This looks amazing! Happy for you to move to the BigLAM org -- let me know if you want any help with the dataset card.
Update:
Here is the dataset on the hub, with a comprehensive dataset card: https://huggingface.co/datasets/biglam/oldbookillustrations
Let me know what you think can be improved.
Thanks so much for this. Having given this a bit more thought, I think it probably makes sense to try and filter out the items which may have copyright issues. I think the best way to do this would be to filter out based on the artist_date field and set a reasonably conservative threshold. My suggestion would be to push this to a new dataset and keep the other one private. This means we can use the current version to update the cleaned version in the future.
@giganttheo WDYT?
Thanks so much for this. Having given this a bit more thought, I think it probably makes sense to try and filter out the items which may have copyright issues. I think the best way to do this would be to filter out based on the
artist_datefield and set a reasonably conservative threshold. My suggestion would be to push this to a new dataset and keep the other one private. This means we can use the current version to update the cleaned version in the future.@giganttheo WDYT?
suggested approach in this notebook https://gist.github.com/davanstrien/e34e239cbf792057f79e2e2162d1e4b1
I agree that it would be nice to make sure the version we share does not have any copyright infringement issue. However, from my understanding, checking if a work is public domain might not be as straightforward as the filter you set up, since it depends on whether the work was published or not, as well as the country of origin. According to the "public domain" Wikipedia Page:
Determination of whether a copyright has expired depends on an examination of the copyright in its source country.
I think adding a filter to know if an artwork is public domain is a good idea, but it will require a lot more work: the dataset shall include some information I missed out (I just included the artist birth date for instance, but really it's the death date that is more important in that case), and we neet some basics of copyright law for the source countries of the artworks.
In my opinion, we could keeping a complete version available, with a warning about copyright issues at first, and then when a new version is ready we could add another one with the public domain artworks only. What do you think?
I will investigate this when I have some time.