beets icon indicating copy to clipboard operation
beets copied to clipboard

Allow to configure which fields are used to find duplicates

Open jcassette opened this issue 3 years ago • 14 comments

Description

This adds an option duplicate_keys which allows to configure the behavior of the importer when it searches for duplicates. For example, one could add the format field so that the importer will always keep albums that have different formats.

Can you review this please? I still have to implement it for singletons. Also what do you think about allowing to use a Python function in duplicate_action for full customization?

To Do

  • [x] Documentation
  • [x] Changelog
  • [x] Tests

jcassette avatar Dec 18 '21 21:12 jcassette

Hi @sampsyo and thanks for the review I have updated this with your comments

jcassette avatar Jan 02 '22 16:01 jcassette

It seems I can't trigger the CI checks somehow...

jcassette avatar Jan 02 '22 16:01 jcassette

Awesome! I believe the problem is that the CI doesn't trigger when the branch has conflicts. I've resolved the conflict (in the changelog file) and the tests are running now.

sampsyo avatar Jan 03 '22 17:01 sampsyo

This pull request would enable my use case where I want to import several releases of the same artist which are untitled. Other fields like the catalog number or year would identify the releases however.

@jcassette After reading your pull request I tried to re-implement a very basic working example for the fields (artist, album, year).

I found out that I can import the same album again and again without getting the duplicate warning if one of the fields is None, like the year for example.

allesmi avatar Jan 19 '22 07:01 allesmi

Hi @allesmi and thanks for trying this Can you provide a test case which reproduces this issue? I tried to set "flex" fields to None in test_keep_when_extra_key_is_different and the album was not imported.

jcassette avatar Jan 22 '22 14:01 jcassette

Sorry about the noise - I got a bit confused with git I have fixed the tests and implemented the feature for singletons @sampsyo can you review this please ? I think this PR is complete now

jcassette avatar Jan 22 '22 21:01 jcassette

Hi @allesmi and thanks for trying this Can you provide a test case which reproduces this issue? I tried to set "flex" fields to None in test_keep_when_extra_key_is_different and the album was not imported.

Unfortunately I do not have time to show this as code. Let me instead describe what I have manually tried with your most recent commit: I have imported the album Ketsa - May Starlight Find You (musicbrainz release, download from Free Music Archive). Note that musicbrainz does not have a year for this release.

My config file consists of these options for duplicate_keys and is other than that empty:

import:
  duplicate_keys:
    album: albumartist album year

The first import into an empty database completes successfully. Importing the album again does not warn me about the duplicate, does not ask me what to do with old and new files and instead imports as if it is not a duplicate. After several imports my library ends up like this:

$ tree ~/Music/Ketsa/ -d
/home/me/Music/Ketsa/
├── May Starlight Find You
├── May Starlight Find You [2]
└── May Starlight Find You [3]

It seems as if the year field is an issue. If I remove it from duplicate_keys and import the album again I get asked what I want to do with three old albums and one new album.

allesmi avatar Jan 23 '22 09:01 allesmi

One last thing I forgot to mention: it would be great to avoid the duplication in the model classes. At least some of duplicates and construct_match_queries seem like they could be moved to dbcore to be available on all Model subclasses.

sampsyo avatar Jan 29 '22 23:01 sampsyo

Thanks for the review @sampsyo ! I have made the changes you requested.

I'm sorry for being slow, but I'm still having some trouble understanding why tmp_album and tmp_item are necessary. You mentioned above (and in the comments) that this is to make things work with flexible attributes. But I don't quite see why this is necessary for flexible attributes. Couldn't something like duplicates and construct_match_queries work using only information from the Album and Item classes rather from temporary instances? (Again, sorry for missing the insight here!)

No problem, I probably didn't explain well. If you checkout the first commit and run the test case of the second commit with you will see an error related to the flexible attribute. Also chosen_info only returns values contained in the tags of the audio files, but for inline fields I think I understood that the values are computed in Album and Item classes, that's why I needed to create the temporary instances. I haven't dug into it so much so I hope this helps, or let me know if you want to discuss this further.

jcassette avatar Jan 30 '22 16:01 jcassette

Computed (inline) fields are an interesting point! I guess creating a temporary object may be the easiest/lowest-effort way to obtain those, if people want to use them as deduplication keys.

I suppose there's an underlying confusion I have about flexible attributes, though: since we're dealing with newly-imported albums/items, how could they have flexible attributes on them in the first place? I thought we would only be able to distinguish newly-imported albums based on the data either from disk or from the metadata source… maybe one of the metadata sources uses a flexible attribute? But if so, I thought that would be in the chosen_info dict already? I guess I'm missing something about the underlying problem here…

sampsyo avatar Jan 31 '22 13:01 sampsyo

Sorry about the confusion, I think I had been using the term "flexible attributes" to actually mean "inline fieds"... I may have confused because they seem to work similarly for database queries.

since we're dealing with newly-imported albums/items, how could they have flexible attributes on them in the first place?

If you mean user-defined fields, I was not trying to make this to work, only inline fields. But maybe one album/item could already have user-defined fields if it is being reimported? In this case, with the current code these attributes are not retrieved, so I guess they should not be used in duplicate_keys.

I thought we would only be able to distinguish newly-imported albums based on the data either from disk or from the metadata source…

Yes, that's what is implemented. The data of chosen_info is provided by autotag.current_metadata (from disk) or self.match.info (from metadata source), and inline fields that can be derived from that data are provided by the Album/Item classes.

jcassette avatar Jan 31 '22 19:01 jcassette

Got it; thanks for clarifying! Computed fields (e.g., from the inline plugin) seem like an interesting use case for this. I'm convinced, I think, that creating temporary objects is the easiest route to doing this.

I'll look in more detail at the low-level utilities here shortly…

sampsyo avatar Feb 01 '22 15:02 sampsyo

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jun 12 '22 09:06 stale[bot]

It's still very much on my to-do list to rearrange some of the low-level query manipulation stuff here… 😕

sampsyo avatar Jun 12 '22 15:06 sampsyo

Thanks for your (extreme) patience while I finally got around to giving this PR the attention it deserved. It was 99% there, but I wanted to simplify the low-level query utilities that found their way into the library and dbcore modules. I moved us toward some even simpler, broadly applicable utilities for constructing queries, which I can imagine using many other places throughout the codebase. I think this should also somewhat simplify the logic in the importer module itself.

Please let me know if you have any thoughts! Meanwhile, I'll merge this so it can go out in the upcoming release.

sampsyo avatar Aug 21 '22 17:08 sampsyo