Audiobooks.bundle icon indicating copy to clipboard operation
Audiobooks.bundle copied to clipboard

How to parse copyright year

Open rabelux opened this issue 4 years ago • 6 comments

I'm getting an error when parsing the copyright line of this book: ©Knaus Verlag (P)2002 Mango Studios Köln

The error says AttributeError: 'NoneType' object has no attribute 'group' in line 674 executing helper.date = re.match(".?(\d{4}).*", cstring).group(1)

I had a look at the code and wanted to write a fix but don't understand the cases you're trying to catch. Maybe we could collect different examples and expected output?

As far as I understand you're stripping the string down to the part before (P) and extract the date from that part only. What compells against matching the first 4-digit part in the whole copyright?

rabelux avatar Sep 19 '21 13:09 rabelux

This code came from unending's fork: https://github.com/Unending/Audiobooks.bundle/commit/85694cbb981193f615f2e173d6043e4b2448c8f3

I didn't personally test it. I can try and help a bit later. The regex is saying something along the lines of "match 4 digits in a row from the given string". 101regex is a great tool to learn more about regexes. Since copyrights only contain years, all it needs to match is those 4 digits.

djdembeck avatar Sep 20 '21 16:09 djdembeck

My code starting in line 658 currently looks like this:

        if cstring:
            if "Public Domain" in cstring:
                helper.date = re.match(".*\(P\)(\d{4})", cstring).group(1)
            else:
                if cstring.startswith(u'\xA9'):
                    cstring = cstring[1:]
                helper.date = re.search(r'\d{4}', cstring).group()
                #if "(P)" in cstring:
                #    cstring = re.match("(.*)\(P\).*", cstring).group(1)
                #if ";" in cstring:
                #    helper.date = str(
                #        min(
                #            [int(i) for i in cstring.split() if i.isdigit()]
                #        )
                #    )
                #else:
                #    helper.date = re.match(".?(\d{4}).*", cstring).group(1)

It matches the first 4 digits it finds after the (c). But I see what Unending did there. He tried to prioritize whereas I don't see any reason to do that at this point. I think the (P) stands for sound recording copyright and should be equivalent to (c).

I'm just guessing here so everybody is invited to enlighten me.

rabelux avatar Sep 20 '21 16:09 rabelux

Audible isn't very consistent but the way I've noticed the most common use is that (C) is the original copyright year of the work, and (P) is the copyright year of the specific publication. See here for reference: https://www.audible.com/pd/East-of-Eden-Audiobook/B00546SXO0

I personally prioritize (C) year, as I think that sorting by year, or filtering by decade works better when the original copyright year is used, but the (P) is also important and should be equivalent to the release date. Both dates need to be used, but I don't know of any player that takes advantage of them. Ideally the id3 tags should be: ORIGYEAR = (C) year YEAR = (P) year RELEASETIME = (P) date

seanap avatar Sep 21 '21 17:09 seanap

Regarding the example you posted: Would you prefer to have the year set to 1952, or 1980?

As we only have one year to set in Plex I'd suggest to simplify copyright-parsing and do it in the following order: Take the first year that can be found, unless there is ; in the string, then take the first year after ;

rabelux avatar Sep 23 '21 14:09 rabelux

For (C) it should be the original year, so 1952. The actual plex tag is "Release Date" so I think (P) 2011 should be the year/date actually imported into plex.

seanap avatar Sep 23 '21 15:09 seanap

The part of the code I'm talking about is only called if the preferences are set to "use copyright year instead of date published". So in that case it should be correct to use the first year found - unless the setting has to be renamed or changed to a dropdown list.

rabelux avatar Sep 23 '21 15:09 rabelux