public backup?
What worries me a bit at the moment is, that I am the only person that has a full backup of all the data. I do not expect to die anytime soon, but let's face it...you never know when you get run over by a bus.
I want to use that ticket to brainstorm how we could solve that.
Are there any public services that help open source projects with their backup?
I think the easiest solution would be to create a full dump of the database (excluding any personal informations, like email addresses, usernames and hashed passwords,...) together with an archive containing all the images.
Maybe archive.org can help here? Or do they only archive data they can collect via their own crawler?
There is indeed the possibility to upload custom data to archive.org.
I guess we could use the ia commandline client to do that (see: http://internetarchive.readthedocs.io/en/latest/cli.html#upload)
short update: This one is still pretty important to me.
I can't help myself, but it just feels wrong to me to tell people "let's create a public, open source image dataset together" and then all the data is hosted on a server of some guy that nobody knows. ;) What if the server goes down or the guy disappears?
Just to be clear: I am not planning to take the service down, it's just that I don't want to be the only person that has a backup of the data. Over the time, I've invested so much time into other people's projects (contributed something, wrote reviews, added stuff,..) some of them are still alive, some others are gone. Most of the time, all the data was gone as well.
If people are contributing their valuable time to this project, I think it's my obligation to ensure that there exists a proper backup of the data.
If there are no objections, I would propose the following:
The data will be periodically zipped and uploaded to archive.org. The archive will contain the images and the database dump. (I guess we can do that as long as the archive has a reasonable size; after that it's probably better to switch to incremental backups or something different? a mirror server?)
The archive does not include:
- email addresses
- hashed passwords
- API tokens
- access tokens
- still locked donations
- donations that were put in quarantine
- trending labels that aren't productive yet
Furthermore, in every database dump all the usernames will be replaced with some random names (to preserve users privacy, in case someone uses his real name as username)
Would that be ok for you? @dobkeratops
public backup should be fine. cc0