Olaf icon indicating copy to clipboard operation
Olaf copied to clipboard

How to optimize config for a huge movie collection track ?

Open VigibotDev opened this issue 6 months ago • 1 comments

I rewrite the ruby layer to request only the C code, I use the default (stock) config to store cached acoustic fingerprint to a custom directory with a php CLI only script (only exec() line can interest you) :

(TMP operation on tmpfs RAM)

  foreach($audiocodecs as $i => $audiocodec) {
   $acoustic = "$ACOUSTICDIR/$videobase.$filesize.$i.csv.gz";
   if(file_exists($acoustic))
    continue;

   echo "$videofile\n";
   exec("ffmpeg -loglevel quiet -i \"$videofile\" -map 0:a:$i -ac 1 -ar 16000 -f f32le -acodec pcm_f32le \"$TMPRAW\"");

   echo "$acoustic\n";
   exec("Olaf/bin/olaf_c print \"$TMPRAW\" \"$videofile\" | gzip > \"$TMPGZ\" && mv \"$TMPGZ\" \"$acoustic\"", $output);

   unlink($TMPRAW);
  }

And to build the B+ tree I use a loop to load all ~1h40/2h (average movie duration) audio tracks :

  foreach($audiocodecs as $i => $audiocodec) {
   $acoustic = "$ACOUSTICDIR/$videobase.$filesize.$i.csv.gz";
   if(!file_exists($acoustic))
    continue;

   echo "$videofile\n";
   exec("gunzip -c \"$acoustic\"", $output);
   $content = "";
   foreach($output as $line)
    $content .= "1/1,$videobase.$i,$line\n";
   file_put_contents($TMPCSV, $content);
   echo "$acoustic\n";
   exec("Olaf/bin/olaf_c store_cached \"$TMPCSV\"");
  }

I can generate all my audio track fingerprint from movies, all gz are lightweight, but integration into B+ tree part is too slow and db become too huge. even on my i9 64GB DDR5 / PCIE SSD machine :(

How you can store 340 days of audio (around 800GB of mp3s) inside a 15GB database ? mine grow way faster than this with your default config https://github.com/JorenSix/Olaf/blob/master/src/olaf_config.c .... and I have about 4000 days (10 years !) of sound to index !!!!! game over lol But with an estimation based on your result, this must enter inside 150GB database, good for me, but this is not the case (I grow rapidly over multiples terabytes and show as a snail to rebuild the B+ tree from all lightweight fingerprints.csv.gz)

VigibotDev avatar Feb 15 '24 18:02 VigibotDev

Interesting to hear your experiences. I have indeed tested Olaf with 800GB of music and have not experimented with larger datasets or non-music (less information rich) signals.

Some insights/pointers/possible optimisations:

  • Depending on how similar the reference and query audio is and how long a query is allowed to be the configuration can be optimised: if you want to support short (1s) noisy queries you need a lot of fingerprints, if you can deal with long (20s+), clear queries, the system can be configured to use much less fingerprints.
  • To speed up db creation you might want to look at lmdb bulk import which expects sorted keys. The idea here is to extract fingerprints, sort them externally and do a single bulk import. This is not supported by olaf by default but is a potential performance improvement.
  • There will be a lot of duplicate fingerprints in your dataset. I suspect the slowdown and size increase to be related to these hash matches. A B+tree essentially becomes an inefficient list if there are many hash collisions.
  • To further reduce fingerprints you might want to look at a silence treshold. Now Olaf also extracts fingerprints from quiet parts which perhaps can be skipped for your use case.
  • Olaf can be distributed: perhaps you can maintain 10 seperate instances and put an api in front of Olaf to distribute queries and return results.

Good luck with your project

JorenSix avatar Mar 12 '24 13:03 JorenSix