acousticbrainz-client Possibly track full status of files in sqlite log

Currently, the sqlite log file essentially stores two states: unprocessed, by the file not being in the DB, or processed, by it being there, with an optional reason it's marked processed (so, a third 'failed' state, roughly).

This is sufficient for the log, and for keeping track that a particular file failed due to an extractor error (and thus should be retried, though at present only will be if manually deleted from the database). However, it might be useful to keep track of a more complete state, possibly along the lines of:

check for file in DB; if marked currently processing (perhaps with a PID and/or timestamp to check against for validity) or if marked completed/failed (perhaps including the essentia build sha and possibly a timestamp, to check for things that should be re-tried) go to next file
if not, mark the file as processing (by this PID, or at this timestamp). perhaps store a hash of the file so it can be checked if changed in the future, as well
check for MBID, store if there was none in the file as a "failed, no MBID" state
otherwise process with essentia, when done mark completed but not submitted (maybe including what temporary filename it's in), or mark as failed if applicable
submit to server, delete temporary file, mark completed

This would let multiple processes be ostensibly looking at the same set of files, for example, since a file marked currently processing would be skipped by another worker. Storing things like timestamps, PIDs, essentia build hashes, and file hashes could let us do more automatically, such as retrying files that failed based on extractor issues when a new extractor is being used, or when files are retagged but not renamed (or renamed but otherwise unchanged).

Overkill, useful, somewhere in between?

Oct 15 '14 02:10 ianmcorvidae

What is this going to solve in the long term? If we mark additional failures, we need a way of dealing with them as well. I could probably be convinced to do this, but here are some counters/comments to your specific points:

Your complex pid/sha/timestamp system only seems to fix a potential problem with the current system of running find/parallel/timeout. You shouldn't be getting 2 processes doing the same file because parallel distributes your work for you. If you want to quickly process a whole directory tree, wait for multiprocessing support.
No mbid: Yes, should probably be added
Extractor issues: Once we get bugs fixed, hopefully this won't happen again. We already mark 'failed essentia' files. Storing the filename that features are in only seems to catch the "we fail to submit" case. I don't think we need to add so much complexity here. Why don't we just retry to submit x times and if that fails (network disconnected?) just quit.

I agree we need to think a bit about how to submit when people get a new extractor build. Do we store the old extractor version? Do we tell people to just delete the history file?

Oct 16 '14 12:10 alastair

I think I agree on the pid/timestamp system with a 'processing' state, and about the 'pending submission' state. Just seemed like I should sketch out a really complete/complex system so we can pull out the useful bits.

One thing storing a hash of the file would solve that isn't otherwise is retagging with newer data from MB (but not which changes the file location of the file). Perhaps most notably this can sometimes change the recording MBID, though it should usually be from a redirected to a non-redirected one (though at present neither client nor server does anything about redirected MBIDs). Maybe we don't care about this and just expect anyone using the data to look things up by recording MBID rather than using the tags section, though?

Oct 16 '14 21:10 ianmcorvidae

Extractor issues: Once we get bugs fixed, hopefully this won't happen again.

I disagree with this point, fixing bugs in extractor is a must, but client should be able to gracefully handle unexpected failures anyway. Future versions of extractor will come with their own bugs ;)

+1 for file hash, it would help to track moved files having no change, and files at the same place with metadata changes.

Oct 17 '14 10:10 zas

acousticbrainz-client acousticbrainz-client copied to clipboard

Possibly track full status of files in sqlite log

acousticbrainz-client
acousticbrainz-client copied to clipboard