TMSU icon indicating copy to clipboard operation
TMSU copied to clipboard

tmsu files [--count] $ONETAG is surprisingly slow

Open 0ion9 opened this issue 10 years ago • 7 comments

Possibly a minor issue, but I just spotted this interesting discrepancy:

time sqlite3 .tmsu/db 'select count(*) from (select 1 from file_tag where tag_id = (select id from tag where name = "tagname") group by file_id)'`

420825
sqlite3 .tmsu/db   0.68s user 0.03s system 99% cpu 0.712 total
time tmsu -D .tmsu/db files --count tagname
420825
tmsu -D .tmsu/db files --count tagname  14.22s user 1.15s system 100% cpu 15.365 total

Now, unless I misunderstand what TMSU is supposed to be doing, my query above should always achieve the same result. And it's 22 times faster, which leads me to think that something TMSU is doing is rather inefficient for this case.

More seriously, the analogous non --count command is 11 times faster than tmsu files:

time tmsu -D .tmsu/db files tagname > /dev/null
tmsu -D .tmsu/db files tagname > /dev/null  16.10s user 1.45s system 100% cpu 17.547 total
time sqlite3 -separator / .tmsu/db 'select F.directory,F.name from file_tag as FT join file as F on F.id=FT.file_id where FT.tag_id = (select id from tag where name = "tagname") group by file_id' > /dev/null
sqlite3 -separator / .tmsu/db  > /dev/null  1.44s user 0.15s system 99% cpu 1.591 total

(all timings are done from a warm start -- that is, running the command twice consecutively and taking the second timing. Cold starts add up to 2s to TMSU timings, up to 1s to sqlite3 timings.)

I understand why more generalized code would be much slower -- have to do grouping so that you get only files for which all the criteria are satisfied, rather than files for which one of the criteria are satisfied, but IME, the single positive tag case is common enough to deserve its own fast code path.

0ion9 avatar Feb 04 '15 08:02 0ion9

Hi and thanks for reporting this.

Yes that does seem slow. I'll take a look.

oniony avatar Feb 04 '15 10:02 oniony

time sqlite3 .tmsu/db 'select count(*) from (select 1 from file_tag where tag_id = (select id from tag where name = "tagname") group by file_id)'`

A fairer comparison is if you replace the 1 with *.

I have run some tests on a dummy database of 1,000 tags applied to 4,000 files. On a tag applied to 3,000 of the files I get TMSU taking 0.34s and Sqlite3 taking 0.04s. On a tag applied to 1,000 of the files I get TMSU at 0.11s and Sqlite3 at 0.02s.

oniony avatar Feb 06 '15 18:02 oniony

I'll have a look at improving the performance but I think this is acceptable performance for 0.5.0 release. I'll will spend some time post release trying to get the times down though.

oniony avatar Feb 06 '15 19:02 oniony

I've just revisited this. In my (limited) tests it appears TMSU 0.7.0 takes four times as long as the sqlite query. Do you still have concerns about performance?

oniony avatar Dec 17 '15 07:12 oniony

After reviewing this, obviously my original case was flawed, as the sqlite equivalent did not process implications and thus might produce only a subset of the files produced by tmsu.

My tests show 12s warm for 'tmsu files --count $TAG' on a 233k-files query, 13s warm for tmsu files $TAG and 0.4s warm for sqlite count query (modified with select * rather than select 1). A fully corrected query -- I arrived at select count(*) from (select distinct file_id from file_tag where tag_id in (select id from tag where id in ($TAGS_THAT_IMPLY_FOO, $FOO))) -- takes 0.7s warm, with a variant that outputs paths rather than a count taking 1.4s warm.

More tests are definitely required, but those figures seem to suggest that direct SQL is 9.2x faster than tmsu when listing files (F.directory || "/" || F.name), and 17.1x faster when just counting them. I doubt the validity of the figures I posted in the OP, but these new figures seem to indicate that there is a substantial slowdown. It's certainly not a major issue; it does seem dubious from an outside perspective that the extra processing TMSU does (path normalization, AFAIK) would introduce such a large overhead.

EDIT: I don't currently have any cold start timings.

0ion9 avatar Dec 18 '15 03:12 0ion9

Many thanks for that. Seems there is room for improvement. Obviously TMSU has an overhead but that does seem rather large. I'll try to take some time to look at this but the problem with this type of issue is that they're usually quite time consuming to diagnose so hopefully I'll get some time over Christmas.

oniony avatar Dec 18 '15 11:12 oniony

I'll take a look at this next I think.

oniony avatar Apr 02 '16 16:04 oniony