beets Some duplicate_action settings are not thread safe

When importing a ~1000 track collection in to beets I experienced the following error. My guess is this is due to using duplicate_action: merge whilst having threaded: yes as when I set threaded: no, the problem went away (albeit it took much longer to import).

For example, consider a compilation that is imported from the following directory structure:

/root/
  - foo/foo.mp3
  - bar/bar.mp3
  - zar/zar.mps

And then during the import, beets attempts to merge it with multiple threads in to

/root/
  - compilation/bar.mp3
  - compilation/foo.mp3
  - compilation/zar.mps

Then my guess is that perhaps during the merge, threads can conflict on the ID of a track.

If this is too complicated to fix then I think the documentation should atleast be updated stating certain actions are not thread safe: https://beets.readthedocs.io/en/stable/reference/config.html#duplicate-action

Problem

The content of stderr is 'Traceback (most recent call last):
  File "/lsiopy/lib/python3.11/site-packages/beets/util/functemplate.py", line 574, in substitute
    res = self.compiled(values, functions)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lsiopy/lib/python3.11/site-packages/beets/util/functemplate.py", line 598, in wrapper_func
    args[VARIABLE_PREFIX + varname] = values[varname]
                                      ~~~~~~^^^^^^^^^
  File "/lsiopy/lib/python3.11/site-packages/beets/library.py", line 428, in __getitem__
    value = self._get(key)
            ^^^^^^^^^^^^^^
  File "/lsiopy/lib/python3.11/site-packages/beets/library.py", line 415, in _get
    if self.for_path and key in self.album_keys:
                                ^^^^^^^^^^^^^^^
  File "/lsiopy/lib/python3.11/site-packages/beets/util/__init__.py", line 1081, in wrapper
    value = func(self)
            ^^^^^^^^^^
  File "/lsiopy/lib/python3.11/site-packages/beets/library.py", line 396, in album_keys
    if self.album:
       ^^^^^^^^^^
  File "/lsiopy/lib/python3.11/site-packages/beets/library.py", line 409, in album
    return self.item._cached_album
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/lsiopy/lib/python3.11/site-packages/beets/library.py", line 581, in _cached_album
    self.__album.load()
  File "/lsiopy/lib/python3.11/site-packages/beets/dbcore/db.py", line 562, in load
    assert stored_obj is not None, f"object {self.id} not in DB"
           ^^^^^^^^^^^^^^^^^^^^^^
AssertionError: object 8 not in DB
 
During handling of the above exception, another exception occurred:
 
Traceback (most recent call last):
  File "/lsiopy/bin/beet", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/lsiopy/lib/python3.11/site-packages/beets/ui/__init__.py", line 1285, in main
    _raw_main(args)
  File "/lsiopy/lib/python3.11/site-packages/beets/ui/__init__.py", line 1272, in _raw_main
    subcommand.func(lib, suboptions, subargs)
  File "/lsiopy/lib/python3.11/site-packages/beets/ui/commands.py", line 973, in import_func
    import_files(lib, paths, query)
  File "/lsiopy/lib/python3.11/site-packages/beets/ui/commands.py", line 943, in import_files
    session.run()
  File "/lsiopy/lib/python3.11/site-packages/beets/importer.py", line 340, in run
    pl.run_parallel(QUEUE_SIZE)
  File "/lsiopy/lib/python3.11/site-packages/beets/util/pipeline.py", line 446, in run_parallel
    raise exc_info[1].with_traceback(exc_info[2])
  File "/lsiopy/lib/python3.11/site-packages/beets/util/pipeline.py", line 358, in run
    self.coro.send(msg)
  File "/lsiopy/lib/python3.11/site-packages/beets/util/pipeline.py", line 170, in coro
    task = func(*(args + (task,)))
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/lsiopy/lib/python3.11/site-packages/beets/importer.py", line 1566, in manipulate_files
    task.manipulate_files(
  File "/lsiopy/lib/python3.11/site-packages/beets/importer.py", line 757, in manipulate_files
    item.move(operation)
  File "/lsiopy/lib/python3.11/site-packages/beets/library.py", line 915, in move
    dest = self.destination(basedir=basedir)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lsiopy/lib/python3.11/site-packages/beets/library.py", line 979, in destination
    subpath = self.evaluate_template(subpath_tmpl, True)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lsiopy/lib/python3.11/site-packages/beets/dbcore/db.py", line 625, in evaluate_template
    return template.substitute(self.formatted(for_path=for_path),
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lsiopy/lib/python3.11/site-packages/beets/util/functemplate.py", line 576, in substitute
    res = self.interpret(values, functions)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lsiopy/lib/python3.11/site-packages/beets/util/functemplate.py", line 568, in interpret
    return self.expr.evaluate(Environment(values, functions))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lsiopy/lib/python3.11/site-packages/beets/util/functemplate.py", line 256, in evaluate
    out.append(part.evaluate(env))
               ^^^^^^^^^^^^^^^^^^
  File "/lsiopy/lib/python3.11/site-packages/beets/util/functemplate.py", line 163, in evaluate
    if self.ident in env.values:
       ^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen _collections_abc>", line 780, in __contains__
  File "/lsiopy/lib/python3.11/site-packages/beets/library.py", line 428, in __getitem__
    value = self._get(key)
            ^^^^^^^^^^^^^^
  File "/lsiopy/lib/python3.11/site-packages/beets/library.py", line 415, in _get
    if self.for_path and key in self.album_keys:
                                ^^^^^^^^^^^^^^^
  File "/lsiopy/lib/python3.11/site-packages/beets/util/__init__.py", line 1081, in wrapper
    value = func(self)
            ^^^^^^^^^^
  File "/lsiopy/lib/python3.11/site-packages/beets/library.py", line 396, in album_keys
    if self.album:
       ^^^^^^^^^^
  File "/lsiopy/lib/python3.11/site-packages/beets/library.py", line 409, in album
    return self.item._cached_album
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/lsiopy/lib/python3.11/site-packages/beets/library.py", line 581, in _cached_album
    self.__album.load()
  File "/lsiopy/lib/python3.11/site-packages/beets/dbcore/db.py", line 562, in load
    assert stored_obj is not None, f"object {self.id} not in DB"
           ^^^^^^^^^^^^^^^^^^^^^^
AssertionError: object 8 not in DB

My configuration (output of beet config) is:

plugins: web
directory: /music
library: /config/musiclibrary.blb
art_filename: albumart
threaded: yes
original_date: no
per_disc_numbering: no
 
convert:
    auto: no
    ffmpeg: /usr/bin/ffmpeg
    opts: -ab 320k -ac 2 -ar 48000
    max_bitrate: 320
    threads: 1
   
# don't mind if tracks are missing from an album
match:
  distance_weights:
    missing_tracks: 0
 
paths:
    default: $albumartist/$album%aunique{}/$track - $title
    singleton: Non-Album/$artist - $title
    comp: Compilations/$album%aunique{}/$track - $title
    albumtype_soundtrack: Soundtracks/$album/$track $title
        
import:
    write: yes
    # we copy so pymix can remove the import src dir when the
    # process has completed successfully.
    copy: yes
    move: no
    resume: ask
    incremental: no
    incremental_skip_later: no
    # if duplicates found then merge them in. This can happen
    # if tracks for a compilation are split over multiple
    # directories. Then each time beets tries to import one of
    # the directories it will think that it is importing a
    # duplicate compilation album. This is acceptable since the
    # tracks should be distinct so we just ask beets to merge.
    duplicate_action: merge
    quiet_fallback: asis
    timid: no
    log: /config/beet.log
 
      #lastgenre:
      #    auto: yes
      #    source: album
 
embedart:
    auto: yes
 
fetchart:
    auto: yes
   
replaygain:
    auto: no
 
scrub:
    auto: yes
 
replace:
    '^\.': _
    '[\x00-\x1f]': _
    '[<>:"\?\*\|]': _
    '[\xE8-\xEB]': e
    '[\xEC-\xEF]': i
    '[\xE2-\xE6]': a
    '[\xF2-\xF6]': o
    '[\xF8]': o
    '\.$': _
    '\s+$': ''
 
web:
    host: 0.0.0.0
    port: 8337

Thanks

Luke Purnell

Oct 10 '23 11:10 laker-93

Thanks for the report! This looks bad indeed. It looks tricky to debug, but we should try nonetheless.

The problem here actually seems to come from a performance optimization. To avoid needing to constantly re-fetch the album information associated with every item, we have a cache for the album structure in memory: https://github.com/beetbox/beets/blob/e10b955a931e4c205b0cadf0860797c0aeee736c/beets/library.py#L607-L620

What has happened here is that an item's album has been deleted (presumably as part of the duplicate merge), so the later attempt to refresh the album's metadata on demand fails.

A possible solution here would just be to catch this kind of exception and, when the "refresh" fails, to look it up again from the database. This solution could be problematic if album IDs are reused.

Here is an idea, for anyone who is interested: I would love to know how much performance this is actually saving us. Whenever we do item._cached_album, one of two things can happen: https://github.com/beetbox/beets/blob/e10b955a931e4c205b0cadf0860797c0aeee736c/beets/library.py#L616-L619

That is, we either fetch the entire album "from scratch" (self._db.get_album(self)) or we refresh the existing album (self.__album.load()). The code clearly intends for the latter to be faster than the former. But is it really? I don't quite see why reloading all the contents would be that much faster than reconstructing the Python object from those contents. Maybe someone would be interested in measuring the real-world performance impact of disabling this optimization (i.e., always using the first route)?

Another possible fix, if we want to keep this optimization, would just be to check whether self.__album.id == self.album_id. If not, then fall back to the slow path. (This seems like an easy change.)

Oct 13 '23 19:10 sampsyo

Thanks for the information here. I re ran the import with the library of music that initially caused this problem on a more powerful machine and have not hit the issue again. So it's possible this issue only presents itself on a resource constrained environment.

The music collection consisted of around 700 tracks. I ran the import with and without threading enabled and with and without the suggested change (only importing the album from scratch with self._db.get_album(self)). The results were as follows (total time give in bold with units minutes:seconds.milliseconds):

(With threading from scratch) beet import /Users/lukepurnell/rb_export/lajp/subbox_export 26.09s user 10.06s system 4% cpu 12:24.96 total

(With threading no changes) beet import /Users/lukepurnell/rb_export/lajp/subbox_export 21.98s user 9.30s system 4% cpu 12:27.97 total

(Without threading from scratch) beet import /Users/lukepurnell/rb_export/lajp/subbox_export 26.87s user 10.61s system 5% cpu 12:26.61 total

(Without threading no changes) beet import /Users/lukepurnell/rb_export/lajp/subbox_export 28.89s user 10.53s system 5% cpu 12:28.16 total

As you can see, in both threading and no threading cases, fetching the album from scratch regardless is marginally quicker. Also there is only a marginal gain from running with threading.

In terms of progressing this issue. Perhaps we need to do more testing to ensure that fetching from scratch every time is not going to cause a degradation in some other use case. Perhaps you can think of some tests to run to confirm this?

Also, have you considered multi-processing instead of multi-threading for performance? Due to the GIL, this may result in more meaningful performance gains.

Dec 28 '23 17:12 laker-93

beets beets copied to clipboard

Some duplicate_action settings are not thread safe

Problem

beets
beets copied to clipboard