cherrymusic icon indicating copy to clipboard operation
cherrymusic copied to clipboard

Make CherryMusic ignore hidden files

Open tilboerner opened this issue 9 years ago • 6 comments

Proposal spawned by #518 (in which CherryMusic scans a .git directory):

Hidden files in the basedir should be completely ignored by CherryMusic. CherryMusic's API should behave as if they didn't exist.

For our purposes, a file is "hidden" if its name starts with a .. Not worth it to accomodate Windows here.

We should make sure that hidden files are:

  • [ ] not scanned,
  • [ ] not in file database,
  • [ ] not in browse results,
  • [ ] not in search results,
  • [ ] not served.

Some of these cases are already handled this way, but CM should be consistent here.

@devsnd, @6arms1leg: Interested to hear your comments. Any reasons why we shouldn't do this? Do we need a whitelist? If so, can we expect that list to remain small and manageable?

tilboerner avatar Mar 05 '15 11:03 tilboerner

i See no reason why the would be a reason for a whitelist. just skip/hide hidden files everywhere. and I don't think we need to remove them retroactively. this would happen automatically when rescanning the files anyway, if we include make sure the filedb does not index them.

On March 5, 2015 12:19:12 PM CET, Til Boerner [email protected] wrote:

Proposal spawned by #518 (in which CherryMusic scans a .git directory):

Hidden files in the basedir should be completely ignored by CherryMusic. CherryMusic's API should behave as if they didn't exist.

For our purposes, a file is "hidden" if its name starts with a .. Not worth it to accomodate Windows here.

We should make sure that hidden files are:

  • not scanned,
  • not in file database,
  • not in browse results,
  • not in search results,
  • not served.

Some of these cases are already handled this way, but CM should be consistent here.

@devsnd, @6arms1leg: Interested to hear your comments. Any reasons why we shouldn't do this? Do we need a whitelist? If so, can we expect that list to remain small and manageable?


Reply to this email directly or view it on GitHub: https://github.com/devsnd/cherrymusic/issues/520

devsnd avatar Mar 06 '15 13:03 devsnd

No objections here, either. :+1: Also, I don't see the need for a whitelist.

6arms1leg avatar Mar 07 '15 13:03 6arms1leg

I looked at some example data. Filtering dot-anything is a bad idea. In fact, I think we need a smarter filter, or a blacklist, or we dont do this at all.

Will post some examples when back @ keyboard.

tilboerner avatar Mar 07 '15 13:03 tilboerner

Alright, the vast majority of "true" hidden names are short and contain only alphabetical characters after the initial ., with the exception of maybe ONE more of these: ._-. We'll never have absolute certainty, but given the following examples I came across, we should be fine using

[.][a-zA-Z]+

as a filter.

Interestingly, all files I found starting with a . were good to be ignored; only directories were problematic. I wouldn't consider that a rule, though.


Here are some example "dotfile" (and directory) names. Some of the name-y bits are altered to protect the privacy of my data source. :smile_cat: Non-alphabetic characters are the same as in the actual name.

Example directory things that SHOULD NOT be filtered out:

  • ... Damning Stinkwell Up Integration The C***
  • ...To Be Fitted (1971)
  • .Decompulse_Glycolipids_Dressionist_Stylize_Biopoly
  • .Gormants.Actuation.1994
  • .And.Now.The.Introids.Empty.1991
  • .O.Prebinding

Example file things that are pure metadata and SHOULD be filtered out:

  • ._01 Her Swatchtowering Are Deportorial The Stockman.mp3
  • ._07 Tautly The Ammonial.mp3
  • ._2-08 Strumpet You Running Bagpipes.mp3
  • .08 - Synkaryocytic Equability.mp3.tenebriating7c (caught by isplayable)

Here are some clear names we want to filter:

  • Directories
    • .AppleDouble
    • .FBCLockFolder
    • .mediaartlocal
    • .git
  • Files (caught by isplayable):
    • .DS_Store
    • .Parent
    • .date
    • .message
    • .mp3genre
    • .ioFTPD

tilboerner avatar Mar 07 '15 23:03 tilboerner

Alright, I'd propose to use a regex blacklist. We can compile the regexes on server startup so there shouldn'd be any noticable performance difference. This list of filters might do the trick:

[
    '\.AppleDouble',
    '\.FBCLockFolder',
    '\.mediaartlocal',
    '\.git',
    '\.DS_Store',
    '\.Parent',
    '\.date',
    '\.message',
    '\.mp3genre',
    '\.ioFTPD',
    '\._.*',
]

But of course there are more to come. Regex might already be overkill but I'd rather be safe than sorry.

devsnd avatar Mar 14 '15 13:03 devsnd

Yeah, let's treat the strings as regexes from the start.

Would it be a bad idea to concat them all into a single expression like ^(expr1)|(expr2)|...$? For example, if the list grows a bit longer?

I'm wondering about this little fella: \._.*. What's he up to? He's a leading space away from the .Decompulse_Glycolipids_Dressionist pattern, which we want to allow, so I don't trust him very much. But I'm willing to try coexisting with him.

By the way, have you heard there's a new album by the_underscores, that well-known band of indie python coders? It's titled .___ (pronounced "attribute whitespace"). ;) But yes, I agree, they are quite silly and can't expect to be indexed by anyone.

tilboerner avatar Mar 14 '15 15:03 tilboerner