Add tests suite
I thought it would be good to have this draft PR open so that we can discuss bugs that the tests reveal as I find them.
First one is a bug I've been trying to nail dain for a while: @Neurrone what is the reason that the path of the items were used to sort instead of the name? There's a comment that specifies that it needs to be that way but not why. I've made a bunch of test cases to check sorting under different naming schemes and the paths mess up, while the names do not (so far).
If you remember the use cases that necessitate using the path, could you please tell them so I can add them to the tests? It's possible that a more complex sorting algorithm will be required but it needs tests before I can start working on a test.
Also, question: this plugin seems to have a different mechanism of matching tracks than the normal beets algorithm, like the methods for importing them are different. Is this intentional? It's very likely that I'm not understanding the whole process but there are issues that this impacts.
Consider the ordering issue; if beets can normally order the tracks according to the returned album when searching (either on MB or Audible), then that might be the solution to the ordering issue for the tracks, but that doesn't happen, and instead there is the natsort that fails in a lot of cases.
Apologies for the delayed response.
I am trying to remember the reason for this.
I believe it was because I found it unreliable to rely on the track titles and it was usually far more accurate to rely on the file names to natsort them to get the correct order of the tracks. Otherwise, I would have stuck to using the track titles, per default Beets behaviour.
If the ordering of the files on disks that Beets detects doesn't match the data source from Audible, then Beets would try to assign the wrong files to the wrong chapter titles from Audible.
So I did some more testing and reimporting a bunch of books into my library to find those edge cases. Natural sorting works well for a lot of cases but not when bytes are used and not when there are different naming schemes (e.g. Chapter 1-10 and then a chapter named Prologue; the prologue will be placed at the end because P comes after C).
I'm not sure if there's any good way to sort this with a one-size fits all approach given the sheer breadth of possible track namings. I would like to try playing around with the track matching and distance algorithms, with your permission, to try and get that working correctly instead of this system with natsorting. There are a couple reasons for this:
- I have had issues where the track sorting still fails and it renames chapters inappropriately, so this will have to be addressed anyway at some point
- The Audible response is the master record of the chapters, and known to be correct. If we can match to that then we should; if we can't then natural sorting or a more complex algorithm can be a fallback.
- This is somewhat at odds with the rest of beets and their methodology so intrinsically this method will not benefit from any further improvements that beets makes to their core functions.
It's up to you if you'd like to take this approach but right now, the tests I've got don't seem to have an immediate answer to get them working. They fail with the old method and they fail with a natural sort on the titles too.
I'd be happy to use a better solution.
I used the Natsort method on paths because it seemed to be the easiest way for it to work for most cases. Part of the trade-off was accepting it would never work all the time.
Ignore all the commits here, I need to upload them to github to test them on my system, I'm working out the kinks with a matching method, with some promising results! See how it goes.
It is finally complete! I have created a series of algorithms that allow for most books to be matched and which also allows for extension in the future. It is completely configurable and comes with tests that cover the entire chapter matching process along with a couple of other tests that cover the things that are likely to fail.
What do you think @Neurrone?
Thanks for working on this. Given this is a large change, I'll probably only be able to take a look at this in a week or two.
Sure no rush.
I've finished looking at the implementation but haven't gone through the tests in detail yet.
Really appreciate the concept of a list of strategies that you introduced, since it does simplify the overall logic and allows customization.
My main concern is the complexity of some of the algorithms relative to how the matching was done previously. The previous implementation used the file input if the number of tracks didn't match the album info from Audible. I suspect that trying to perform matches in these cases (e.g, chapter_levenshtein) is hard to debug and may not work properly.
What do you think of restoring that logic and having a simpler default list? So something like:
- source_numbering
- Instead of having a special case for matching on a single file, replace it with a strategy that replaces the info from Audible with the input list if the number of chapters don't match
- natural_sort
Complexity is required for complex cases. The simple fact is that the old method for doing this didn't cope well with the varied track titles that I found and used personally. Simple methods don't work when the data input isn't simple.
The previous implementation used the file input if the number of tracks didn't match the album info from Audible.
To address your point, using the file input doesn't always work. It doesn't work when you don't have source numbering, which is a lot of audiobooks in my experience. Even if the number of tracks do match, that's not actually a guarantee that the actual tracks match. Consider the difference between a book with 20 chapters on audible, and a the actual audiobook which has been split into 20 tracks of the same size. There are 20 tracks in both cases, but using the audible method is objectively wrong and won't match the actual book.
The goal of the many different measures is that each person can change the ordering based on what audiobooks they're importing at the time. If you're importing a book that you know Levenshtein will work well at, you can make it the primary method that the plugin uses.
Instead of having a special case for matching on a single file, replace it with a strategy that replaces the info from Audible with the input list if the number of chapters don't match
This is the Levenshtein function, but it is very bad at doing this. Consider that you don't know which chapters match up, because there might not be any ordering. Any numbers in the title are ignored (they're considered in the natural sort) but you don't know which matches to what track from Audible. You don't know if any of them match at all. What the program does is calculate all of the distances from every track in the files to every track from Audible. Then, it matches the closest match with the file and overwrites the title. This isn't actually guaranteed to be a close match. If there's a single letter in common, then the closest match will be that, but it's not guaranteed to be accurate or even relevant.
I suspect that trying to perform matches in these cases (e.g, chapter_levenshtein) is hard to debug and may not work properly.
Your concern over the Levenshtein is a little strange to me. This is the method that was used before. The Levenshtein function is what was used before when you called the track distance function. I just made it explicit so that we could improve on it. It's the last in the list because it's the least effective and most prone to errors. That is the old method.
the old method for doing this didn't cope well with the varied track titles that I found and used personally. Simple methods don't work when the data input isn't simple.
I understand, I'm just trying to get a feel for what examples in the wild would work better with these algorithms. I'll admit the inputs that I have are generally quite simple - either they match on a chapter to chapter basis, or they don't.
Even if the number of tracks do match, that's not actually a guarantee that the actual tracks match. Consider the difference between a book with 20 chapters on audible, and a the actual audiobook which has been split into 20 tracks of the same size. There are 20 tracks in both cases, but using the audible method is objectively wrong and won't match the actual book.
This should already be handled currently. If the number of tracks matches, the previous method wouldn't override the match from Audible and it falls back to the default matching, so you'd see that the matches fail if the lengths are off for example.
This is the Levenshtein function, but it is very bad at doing this. Consider that you don't know which chapters match up, because there might not be any ordering. Any numbers in the title are ignored
What I think is missing here is the old behaviour where if the folder has 2 files while Audible returns 20 chapters, I want to have the ability to completely ignore the contents from Audible, since I know its not split by chapter. It is meant to prevent the poor results from falling back to the default Levenshtein function.
Your concern over the Levenshtein is a little strange to me. This is the method that was used before. The Levenshtein function is what was used before when you called the track distance function.
I didn't realize that, I'll take a look at the Beets source to understand this better.
The tests that I've included are real world examples. I got a bunch of different audiobooks (hundreds) and started importing them. When I ran into issues, I added a test for the condition and worked to make sure that they passed correctly. That's why the tests try the correct ordering, then reversing, then randomisation, to make sure that they come up with the correct answer every time.
This should already be handled currently. If the number of tracks matches, the previous method wouldn't override the match from Audible and it falls back to the default matching, so you'd see that the matches fail if the lengths are off for example.
I don't think it did this? The old method was to match if the number of tracks were the same, regardless of length. That wasn't part of the calculation.
What I think is missing here is the old behaviour where if the folder has 2 files while Audible returns 20 chapters, I want to have the ability to completely ignore the contents from Audible, since I know its not split by chapter. It is meant to prevent the poor results from falling back to the default Levenshtein function.
You can! This would be the natural sort, source numbering, or chapter numbering algorithm. The order in which the algorithms are listed in the configuration is the order in which they're tried. If you have two files that are numbered Part 1 Part 2 then the natsort function will work, as will the 'starting number'. Actually, my code mostly takes only the metadata for the album as a whole i.e. the data taken from Audible is rarely, if ever, the track data, because that's really hard to match. Instead it's all the album level data, like author, ID, narrator, etc.
The levenshtein algorithm is only called, with the default configuration I've provided if the following conditions are met:
- All of the chapter titles are very different (have a Levenshtein difference more than 4)
- There is no contiguous source matching
- There are no common affixes to the titles that, when removed, result in a number e.g. 'Chapter 1', 'Chapter 2', etc
The Levenshtein option, which is basically the track distance, is the last option that is called because it is the most prone to errors. But again, that is the method that the beets library itself uses when you call the track_distance function.
The reason I made it explicit is because, when using the Levenshtein, there are certain replacements that lead to greater errors, particularly numbers. If the Levenshtein function is replacing a 2 with a 4 for example, that's almost certainly a bigger error than a space to a dash or whatever. My idea was to create a custom Levenshtein function that penalises number replacements more, which would make it more accurate imo, but I haven't actually done that yet because it's already a method of last resort.
I do apologise, I'm quite busy in my semester and it's hard finding the spare time to come back to this. I'm trying to knock some things off.
With regards to the source_weight option, which you mentioned earlier, do you know what it's for? Because looking through the code, I get the impression that there's no reason to have this exposed to the user at all, based on how I think you intend for this plugin to be used.
By that, I mean that the option is meant to rank different sources compared to Musicbrainz. Having it be zero means that it is compared equally to Musicbrainz. But the thing is, there will NEVER be an audiobook match from Musicbrainz; for audiobooks, it should always be zero.
If this plugin is used for a separate library (meaning that a beets instance with this plugin will never be used for music), then there's never a reason for this to be anything other than zero. In that case, I think it should be removed from the configuration and we should permanently set it to zero in the code.
Do you have a reason for it to be exposed to the user through the configuration? And do you expect for this to be used in the same library as music?
Np.
Agree that source weight shouldn't be exposed, since the goal is to ignore Musicbrainz-
Great, I removed that from the example config in the README. Do you have any other concerns about the new code?