supysonic icon indicating copy to clipboard operation
supysonic copied to clipboard

scanner uses a lot of memory

Open anarcat opened this issue 5 years ago • 8 comments

Hi!

I've finally found the time to give supysonic a try, tired of dealing with the other Java monstrosities that populate the Subsonic ecosystem. ;) So, first: thanks for providing such an alternative, it's a breath of fresh air.

I have found that the initial scan of a large-ish music database (30 000+ songs) takes a long time (30 minutes with sqlite, 60 minutes with mysql) and a lot of memory (700MB). If I read the code correctly, entries are all recorded in memory before being flushed out to disk all at once. Is that correct?

I think it would be more efficient to write those entries directly to the database instead. It would be faster and less memory hungry.

As an aside, is there any reason why os.walk is not used to walk the file hierachy? I understand it used to be much slower than listdir in previous versions, but since we're stat'ing all files anyways, the performance impact shouldn't be that bad. And besides, newer versions (py 3.5) use os.scandir so it's actually much faster now... All I can see from the git history is 954c75bc which says something about file permissions....

If you're open to the idea, I would like to reimplement parts of the scanner so that it's more efficient and flexible. I would like, for example, to follow symlinks (#48) and ignore certain directories as well.

anarcat avatar Apr 04 '19 03:04 anarcat

Hello!

I'm surprised with your scanning times 😮. I have a library with around 29 000 tracks and it 8 minutes on my computer. But I know the scanner has this issue of opening media files twice (once for metadata and another to check if there's an embedded cover art) that most likely affects its performance. There's no explicit flush nor commit to the database. I'll have to check but I think Pony flushes automatically before an artist/album lookup. It might keep the data cached to rollback the transaction in case of an error though.

os.walk was used before but replaced by os.listdir because it was causing errors with paths that contained invalid unicode characters (see #85). IIRC two errors were raised:

  • one in os.walk code, when concatenating path elements
  • another later when inserting in the database (hence the try/except in https://github.com/spl0k/supysonic/blob/270fa9883b2f2bc98f1482a68f7d9022017af50b/supysonic/scanner.py#L61-L65) If I'm not mistaken the first one only happened with Python 2, Python 3 replacing invalid characters by surrogates which were rejected by Pony or the database (causing the second error)

Go ahead if you have your idea on how to improve the whole thing. I'm not closed to symlinks, but that could lead to potential issues if they create loops on the file system, or point to folders that are already tracked by supysonic.

spl0k avatar Apr 06 '19 12:04 spl0k

I'm surprised with your scanning times open_mouth. I have a library with around 29 000 tracks and it 8 minutes on my computer. But I know the scanner has this issue of opening media files twice (once for metadata and another to check if there's an embedded cover art) that most likely affects its performance.

Performance might be different because my music database is on spinning rust, are you using a SSD?

There's no explicit flush nor commit to the database. I'll have to check but I think Pony flushes automatically before an artist/album lookup. It might keep the data cached to rollback the transaction in case of an error though.

Yeah, I think it might be worth flushing the good stuff once in a while. As things stand now, it looks like all the entries are committed to the database at once, at the very end, and it takes about a minute to do so (gross estimate). I don't have the feeling it can recover from an interruption either, which is a bummer when you're running a process that takes half an hour... ;)

os.walk was used before but replaced by os.listdir because it was causing errors with paths that contained invalid unicode characters (see #85). IIRC two errors were raised: [concat, insert]

So I think you're correct in that concatenating entries will cause Python to crash because it tries to decode the bytes into whatever the environment says it should be. And yes, a similar will occur on insert as well.

The thing with pathnames in UNIX is they are not UTF-8. They can be encoded in any shape or form. In my filesystem, I have files encoded as ASCII, UTF-8, UTF-16 and ISO-8859-1. I know, it's bad, and I should clean that up, but it's not actually breaking any standard whatsoever. And the problem is that people are using various weird encodings for their filenames still. And because we exchange those files around, this is bound to cause problems again.

So my position on filenames is they should always be treated as binary, until they are displayed to the user, at which point an educated guess should be made to try and show something meaningful. There are various ways to do that, including the surrogates techniques, but in any case, broken display shouldn't hurt scanning.

I should also mention that I think Python 2 support is now overrated. ;) I think it's now perfectly acceptable for a "leaf" package like Supysonic to switch to Python 3 completely, even even require support for Python 3.5 for example. Py3 is well distributed and established in all major distributions and keeping Python 2 compatibility shims will just make your work harder for no good reason. ;)

I'll see if I can come up with a better scanner. And preferably one that is verifiably better - do you have any advice on how I could hook performance tests to the test suite?

Oh, and regarding symlinks I'd say: garbage in, garbage out. If people create loops in their filesystem, they get to see supysonic loop forever. Maybe they like stuff like that, I don't know why people would create such atrocities. ;) But from a security perspective, I think it's sufficient to have a flag (off by default) that allows users to say "I trust this filesystem, there are no loops" and enable symlink following. It would certainly be sufficient to my use case anyways. :)

Thanks so much for your feedback, it's very valuable!

anarcat avatar Apr 06 '19 13:04 anarcat

Oh, and about encoding, there are things like chardet that can be used to make an educated guess about character encoding when utf-8 fails as well.. It's slow, but the idea is it's a fallback when all else fails.

anarcat avatar Apr 06 '19 13:04 anarcat

are you using a SSD?

Yes.

So my position on filenames is they should always be treated as binary, until they are displayed to the user

So you're suggesting we store paths as binary? There are times where filenames are indeed presented to the user:

  • if a media file is missing metadata, the filename is used in place of a track title
  • the Subsonic API allows browsing using the folder hierarchy. I think this a bit stupid nowadays and considered replacing this with the "browse via tags" version, but I guess some users might still find some use to this strategy.

I should also mention that I think Python 2 support is now overrated. ;) I think it's now perfectly acceptable for a "leaf" package like Supysonic to switch to Python 3 completely

I'm patiently waiting for Python 2's EOL to remove all that compatibility layer. Doing it earlier crossed my mind but it seems some people still use Py2.7 for Supysonic (eg. #137).

do you have any advice on how I could hook performance tests to the test suite?

err sorry no.

But from a security perspective, I think it's sufficient to have a flag (off by default) that allows users to say "I trust this filesystem, there are no loops" and enable symlink following.

That's so simple I didn't even think about it :D

spl0k avatar Apr 06 '19 19:04 spl0k

So my position on filenames is they should always be treated as binary, until they are displayed to the user

So you're suggesting we store paths as binary?

Yes, actually. Or at least something (e.g. ISO-8859-1) that is 8bit doesn't need to be encoded/decoded.

There are times where filenames are indeed presented to the user:

  • if a media file is missing metadata, the filename is used in place of a track title
  • the Subsonic API allows browsing using the folder hierarchy. I think this a bit stupid nowadays and considered replacing this with the "browse via tags" version, but I guess some users might still find some use to this strategy.

Yeah that's fine. You can do the decode stuff then, but it's important to keep a handle on the binary in the back.

I'm patiently waiting for Python 2's EOL to remove all that compatibility layer. Doing it earlier crossed my mind but it seems some people still use Py2.7 for Supysonic (eg. #137).

I honestly have very little patience for this nowadays. :) For libs I understand, because sometimes there's this one library that hasn't been ported so the entire project still needs to work with python 2. But Supysonic is not a library, really... It's a end-user program, so I don't get why someone would run it as Python 2.

(Well, that's actually a lie: I had to run it as Python 2 on Debian stretch + Apache + WSGI because the WSGI bindings are py2 only. But then I could have set things up with fcgi as well and it should just have worked...)

But from a security perspective, I think it's sufficient to have a flag (off by default) that allows users to say "I trust this filesystem, there are no loops" and enable symlink following.

That's so simple I didn't even think about it :D

Hehe.. Glad to be of service. ;)

anarcat avatar Apr 06 '19 22:04 anarcat

because the WSGI bindings are py2 only

Are you sure?

spl0k avatar Apr 07 '19 12:04 spl0k

gah, didn't find that one when i looked i guess. weird. it was pretty late. :p

anarcat avatar Apr 07 '19 12:04 anarcat

Oh btw, the full scan no longer use a single database session (as Pony calls it) for the whole process. Please tell me if that helps with the original issue.

spl0k avatar Jun 19 '19 13:06 spl0k