beets
beets copied to clipboard
Formatting function to ASCIIfy punctuation only
I've been using beets for a couple of years now and I love it. There's a minor annoyance for me that I've noticed since the beginning and have more or less ignored, but I thought I'd finally ask if there's anything I can do about it. Apologies if I've missed an existing solution in the config guide or setup.
Problem
When beets imported my library it Unicode-ified a lot of previously plaintext ASCII tags & filenames. For example, "El-P - I'll Sleep When You're Dead" becomes "El‐P - I’ll Sleep When You’re Dead" (in both files & tags.)
These look almost the same, but the punctuation is Unicode-ified:
- The dash which was U+002D (ASCII "hyphen-minus") is now U+2010 ("hyphen")
- The apostrophe U+0027 is now a right single quotation mark U+2019.
This isn't beets' doing, if I download the JSON results for the musicbrainz link then these UTF-8 characters are used there.
Similar things apply for other punctuation marks, this is just a good example as it has two of them. :)
The annoyance is:
- Not all players can render UTF-8 tags properly (Kodi on Android seems to struggle, seems related to #1893)
- Some (most?) players will not return results containing the different glyph in the tag if you type a simple punctuation character in the search field. ie typing "hyphen-minus" on the keyboard will not match "hyphen". (I use quodlibet and it treats these as different.)
- I use Linux and the command line renders the UTF-8 characters fine, but I have the same "gotcha" when I go to type the glyphs.
- Musicbrainz doesn't seem to be entirely consistent in how it applies these. For example, I have some tags "El-P" and some tags "El‐P" (UTF-8 hyphen vs. ASCII hyphen-minus).
I know that I can fix this for files by enabling "asciify", and it looks like this was dealt with for the Lyrics plugin in #270. However as well as Latin-character albums I also have a bunch with names in non-Latin script, so I actually want Unicode for things which I can't effectively represent in ASCII.
I guess my dream feature would be a "sanitise punctuation" option where these almost-the-same-as-an-ASCII-character punctuation glyphs get swapped for their ASCII versions in both tags and filenames, but anything else gets left as UTF-8.
I understand that this is a lot more to do with the design of Unicode than the design of beets (and that some people actually care about the distinction between hyphen-minus and hyphen, I just don't care in this case!)
I'd be happy to look into writing a patch for a feature like the above, if that's potentially acceptable. The approach discussed in #270 for lyrics (ie find-replace) seems applicable.
Setup
- OS: Linux
- Python version: 3.6.1
- beets version: 1.4.3
- Turning off plugins made problem go away (yes/no):
Hi! Thanks for the discussion—this is a fairly frequent question, but it's not usually as clearly elaborated as it is here.
It sounds like there are two separate issues:
- Just ASCIIfying a pre-defined set of punctuation, like “ to ". You might imagine defining a cousin to
%asciify{}called%asciify_punct{}or something. - Applying these changes to tags, not just files. This is more or less the domain of the the longstanding request in #488 for a way to apply our powerful templating system to actually modify metadata, including doing that automatically on import.
Does that sound like an accurate synopsis?
As a stopgap, you may be interested in the "replace" section of config.yaml. It works solely on paths and not tags. The slash may not be needed, I edited my config which uses many weird escape characters.
replace:
'[\‐]': -
Sampsyo's summary is great. #1 looks like the way to go, especially with asciify_punct. I'm not a beets contributor / maintainer, so my opinion isn't as important as the people who dig into the code and make it work. Then in the long term, 488 would also be awesome, but if it were easy it probably would be done already.
Hi @sampsyo & @RollingStar ,
Thanks for the great synopsis @sampsyo and the suggestion @RollingStar .
I think the synopsis is accurate, in as much as those two changes would solve this for me perfectly. I hadn't seen 488, thanks for the heads-up.
Cool. I'm marking this as a feature request for the first part: a version of "asciify" that only affects punctuation.
Any news for this? I'd like to see it affecting tags as well, as Last.FM seems to not auto-correct U+2019 to U+0027 and vice-versa.
Apologies for bumping this issue, but it would really be great to have this working as the previous comment suggests.
Thanks for the great tool!
@imiric I have the same desire, and have a hacky fix that works for my purposes. I have a local version of the beets repo that I have patched with these changes:
--- a/beets/autotag/__init__.py
+++ b/beets/autotag/__init__.py
@@ -26,6 +26,9 @@ from .hooks import AlbumInfo, TrackInfo, AlbumMatch, TrackMatch # noqa
from .match import tag_item, tag_album, Proposal # noqa
from .match import Recommendation # noqa
+from unidecode import unidecode
+
# Global logger.
log = logging.getLogger('beets')
@@ -35,10 +38,12 @@ log = logging.getLogger('beets')
def apply_item_metadata(item, track_info):
"""Set an item's metadata from its matched TrackInfo object.
"""
- item.artist = track_info.artist
+ item.artist = unidecode(track_info.artist)
item.artist_sort = track_info.artist_sort
item.artist_credit = track_info.artist_credit
- item.title = track_info.title
+ item.title = unidecode(track_info.title)
item.mb_trackid = track_info.track_id
if track_info.artist_id:
item.mb_artistid = track_info.artist_id
@@ -62,14 +67,16 @@ def apply_metadata(album_info, mapping):
"""Set the items' metadata to match an AlbumInfo object using a
mapping from Items to TrackInfo objects.
"""
for item, track_info in mapping.items():
# Album, artist, track count.
if track_info.artist:
- item.artist = track_info.artist
+ item.artist = unidecode(track_info.artist)
else:
- item.artist = album_info.artist
- item.albumartist = album_info.artist
- item.album = album_info.album
+ item.artist = unidecode(album_info.artist)
+ item.albumartist = unidecode(album_info.artist)
+ item.album = unidecode(album_info.album)
# Artist sort and credit names.
item.artist_sort = track_info.artist_sort or album_info.artist_sort
@@ -102,7 +109,7 @@ def apply_metadata(album_info, mapping):
item[suffix] = value
# Title.
- item.title = track_info.title
+ item.title = unidecode(track_info.title)
This ensures things like dashes, quotes, etc. are simplified to ASCII.
The post above was a great starting point for me. My copy is calling a little utility function to only decode the punctuation:
def pundecode(text):
result = u""
for character in text:
if character.isalpha():
result += character
else:
result += unidecode(character)
return result
I have made a plugin to perform regex replacements on any fields you specify during import: beets-importreplace. The config in the README replaces apostrophes, quotes, hyphens and dashes with their ASCII counterparts. Try it out, let me know if you encounter any issues! :)
Hi, new user here. I found this bug after inquiring on IRC about this same issue. They suggested a couple settings and so I added this to my configuration:
asciify_paths: yes
import:
languages: jp uk de pt jp fr it en
Unfortunately that seems to do nothing at all. I tried other combinations of languages but the result won't change. Ideally I'd like beets to behave like Picard, which I think is very similar to what the OP described. While that gets implemented, is there anything else I can try in my configuration file?