How to perform delta updates
Hey - great documentation resource about using the Open Library data dumps, but just curious how to perform updates on the database once new dumps are available.
Hi
This is a good question, and sorry for late reply!
To be honest, I think the most efficient way of doing updates for now would just be to regenerate the entire database and switch it with the old one. Costly though, given the size of the database.
However, if the data dumps have an updated datetime value for each record presumably the database could store the last updated date and then a process would run on the file dumps to only include records updated since that date. Then either insert or update them (not sure on deletions!).
I'll look into it. If anyone has thoughts feel free to chip in though!
My $0.02
With a bit of leg work, the files can be transformed into valid json by eliminating the first few columns. From there, jq can probably be used to identify keys which have last_modified after some datetime. Otherwise it could be possible with regex or sed to identify lines in the file and process it programmatically to identify which object has been updated.
Catching deletions can be a bit more tricky - you effectively have to say... "for every object in the database - does it exist in the file?". If there was a way to get this data from upstream (eg a changelog) it could make it a little easier.