dejavu
dejavu copied to clipboard
significant performance gain when fingerprinting
So I did some poking around while getting this set up using a Postgres RDS instance and I noticed that the biggest bottleneck when fingerprinting an entire directory was inserting the fingerprints into the database. The current implementation using cur.executemany
performs individual insert operations for each fingerprint, but a bulk insert would be way more performant. I'm using postgres so I just patched my local version to use psycopg2.extras.execute_values
to test this and haven't looked at the mysql implementation, but if it's doing the same thing, then the same gain could be had there.
The results from an empty database, to a completed directory fingerprint which contained two albums and resulted in just over 2.8 million records in the fingerprint table were
________________________________________________________
Executed in 327.66 secs fish external
usr time 219.85 secs 0.00 micros 219.85 secs
sys time 104.46 secs 1315.00 micros 104.46 secs
I didn't get a benchmark on the current implementation because it was taking hours to run the inserts. The small change that I made in my code was in common_database.py
around line 170.
# I was using 1000 for batch_size, but depending on your infrastructure you might want to fine tune it. Maybe 10000 would even be ok, or maybe only 100
with self.cursor() as cur:
execute_values(cur, self.INSERT_FINGERPRINT_V2, values, "(%s, decode(%s, 'hex'), %s)", batch_size)
# for index in range(0, len(hashes), batch_size):
# cur.executemany(self.INSERT_FINGERPRINT, values[index: index + batch_size])
INSERT_FINGERPRINT_V2 looks like:
INSERT_FINGERPRINT_V2 = f"""
INSERT INTO "{FINGERPRINTS_TABLENAME}" (
"{FIELD_SONG_ID}"
, "{FIELD_HASH}"
, "{FIELD_OFFSET}")
VALUES %s ON CONFLICT DO NOTHING;
If I've got time, I'll try to polish up that solution a bit and make a PR, but I don't know how soon I'll have a chance to do that, so if someone else wants to go for it, have at 'er