radiobrowser-api
radiobrowser-api copied to clipboard
Synonym et al. Tags
Radio-Browser's DB is full of near-duplicate tags. Most tagging systems have this problem, e.g. a big part of what stackoverflow's community does is clean up their tags, and wikipedia has janitors who write disambiguation pages.
Radio-Browser's only current strategy seems to be suggesting tags that are superstrings of what you are typing but this doesn't handle:
- typos
- synonyms
- translations
Here's some obvious issues in the current dataset:
- "dj sets", "dj mixes", "dj"
- "tech house", "tech-house", "#tech house"
- "midtempo", "mid-tempo"
- "electronic" and "electronica" and "electro"
- "weather" vs "local weather"
- "radio communitaire" (french) vs "community radio" (english)
- "radio universitaria" (spanish) vs "university radio" (english) vs "universitaire" (french)
- "oldskool" vs "oldschool"
- "podcast" vs "podcasts"
- "medieval" (english) vs "mittelalter" (german)
"Electronic" is a particularly broad tag; it could cover everything from rock to heavy metal to rave, but I think mostly people use it to cover rave music. "Electronica" is confused. I've seen "Electro" used for dance music but also for rock and fusion genres (e.g. https://en.wikipedia.org/wiki/Justice_(band)).
Combination tags are an issue too: should we prefer "edm podcast" or "edm" + "podcast"?
Here's some strategies to reduce this; let's collect others and implement some of them:
- On input the GUI should
- use a spell-checker to find near-spellings, not just sub-strings.
- sort tags by popularity, and censor the weakest tags, and if those tags are entered anyway should issue a warning before accepting them.
- normalize the use of hashtags and dashes (i.e. enforce no leading #s, force all dashes and underscores to spaces)
- Setting up "implication" tags: "hardstyle" should imply "electronic". Or maybe only suggest it?
- Setting up "synonym" tags, with a canonical tag that everything in the set gets rewritten to
- Setting up localisation tags, with the tag displayed in your locale's language (if it exists) (deepl.com / google translate's / https://www.apertium.org's APIs could help here?) but internally stored as an abstract tag ID?
- Disambiguation tags, for when there are genuinely two separate meanings for the same term
- Scan regularly for tags with few entries and put them up for merging/synonyming/translation; The current list of tags has many with only 1 entry, or tags that are translations, or tags that are mis- or alternate- spellings that could be dealt with efficiently like this.
Another big problem is that a lot of people have tagged a lot of stations with
- their city name e.g. http://www.radio-browser.info/gui/#!/bytag/bogota or http://www.radio-browser.info/gui/#!/bytag/lake%20havasu%20city or http://www.radio-browser.info/gui/#!/bytag/beijing or http://www.radio-browser.info/gui/#!/bytag/brisbane or http://www.radio-browser.info/gui/#!/bytag/martha's%20vineyard or http://www.radio-browser.info/gui/#!/bytag/ganonoque
- some people ignored(?) the Country field entirely and just tagged the country, e.g. http://www.radio-browser.info/gui/#!/bytag/iran
- the name of the broadcaster e.g. http://www.radio-browser.info/gui/#!/bytag/bbc, http://www.radio-browser.info/gui/#!/bytag/cbc, http://www.radio-browser.info/gui/#!/bytag/american%20forces%20network, http://www.radio-browser.info/gui/#!/bytag/europa%20fm
- the language e.g. http://www.radio-browser.info/gui/#!/bytag/english%20language, http://www.radio-browser.info/gui/#!/bytag/greek%20programming,
The former City tags should be migrated to a new optional "City" field in the database since obviously there is demand for them, the Language and Country tags should be verified against the proper fields and then deleted, and the broadcaster should be moved to be prefixes on the channel names (the BBC and American Forces Network already have this done), if they aren't already.
And can this tag please get deleted and banned please? http://www.radio-browser.info/gui/#!/bytag/mp3
To reduce these messes happening again, maybe Tags should be specifically renamed "Genre" or a hardcoded list of city names and languages and locations, things that should go in a different field, should filter any input to the tags.
Thank you for documenting this! The only issue is that since anybody can perform a create using the API, not just the frontend, the normalization would have to be done on the backend.
We'd need an algorithm for parsing the tags and generalizing to common words in the database, removing unnecessary ones, etc.
Perhaps tags are too generic. Perhaps people need more of a guideline for normalization. Like you suggested, instead of just "tags", we have "city", or perhaps "genre", "station callsign", etc.
Right now tags are a free for all, so it's hard to stay normalized. If we knew what to expect from the input, we could better normalize.
"Language" is another one. You've got "English, en, englich, eng, etc."
We'd need to comb through the current database and correct all the inconsistencies.
My attitude towards data messiness is to manage instead of rejecting it outright, for example by modifying the UI to discourage divergence: https://github.com/segler-alex/RadioDroid/issues/461. But some amount of messiness is necessary to allow for growth! If no distinct tag could be added there would be no way for genres to grow, invent, nor fuse over time.
We could probably dump the current db (or maybe @segler-alex would be willing to give out a mysqldump?) and write experimental data cleaners that find outlier tags, and propose spelling corrections and other merges.
I'm sure there is academic research about this we could dig up for tips if we get stuck.
@kousu you can download the database at any time using the link on the repo's readme
http://www.radio-browser.info/backups/latest.sql.gz
The database entries for each tag are recorded with the amount of stations that are using that tag, it would be easy to find the outliers and normalize them. Possibly send an email to someone if a tag is made outside the list of tags already in the database
Oh nice, thanks for the pointer @kyjus25!
I took some time to investigate the dump and wrote this script. For a first cleanup pass, it filters for locations misrecorded as tags. It uses the UN's place name database (thanks UN!!):
#!/bin/sh
# analyse.sh
# -----------
# depends:
# - iconv
# - sqlite3
# - mysql2sqlite (from https://github.com/dumblob/mysql2sqlite) -- clone it into a subfolder of this folder
# - gzip
# - curl
# - plus the standard unix tools
# TODO: rewrite as a makefile so that it doesn't redo work?
# set strict mode
set -e -o pipefail -u
#set -x # DEBUG
APP_DIR=$(cd "$(dirname "$0")"; pwd);
export PATH=$PATH:"$APP_DIR"/mysql2sqlite
# 1) get the source db
curl -OL -sS https://www.radio-browser.info/backups/latest.sql.gz
# 2) import it
gunzip -c latest.sql.gz | mysql2sqlite - | sqlite3 latest.sqlite
# DEBUG: show the list of tags to demonstrate that the import worked
#sqlite3 latest.sqlite 'select TagName from TagCache'
# 4) get a list of every town/city/province/country in the world from the UN's LOCODE database: http://www.unece.org/cefact/locode/welcome.html (thanks UN!)
curl -OL -sS http://www.unece.org/fileadmin/DAM/cefact/locode/loc182csv.zip
unzip -u loc182csv.zip
# 5) import it
sqlite3 latest.sqlite '
-- schema manually ported from
-- http://www.unece.org/fileadmin/DAM/cefact/locode/unlocode_manual.pdf
-- but cross-referenced by looking at examples
-- eg ,"AD","ESC","Escaldes-Engordany","Escaldes-Engordany",,"--3-----","RL","0307",,"4231N 00133E",
DROP TABLE IF EXISTS locations;
CREATE TABLE locations (
-- "if the entry has been modified in any way or it has been marked for deletion"
change varchar(1),
country char(2),
place char(3),
name varchar(100),
name_ascii varchar(100),
subdivision varchar(100),
--- this is actually a tag list but the UN didnt encode it in a SQL-friendly way
function varchar(10),
status char(2),
-- this should be a datetime! does sqlite have datetimes?
date varchar(32),
IATA char(10),
latlon varchar(30),
PRIMARY KEY (country, place)
);
'
# BUGS:
# - some rows, mostly the top-level countries, have an extra unnecessary comma and sqlite complains; can we.... reformat the data? set some pragma to avoid that?
# - some rows fail uniqueness; I think this is because the DB contains occasional name translations when a place has two or more well-used names (see 3.3.6 in http://www.unece.org/fileadmin/DAM/cefact/locode/unlocode_manual.pdf)
ls *CodeListPart*.csv | # the UN database comes chunked across separate .csv files
while read LOCODE; do
#ln -sf "$LOCODE" import.csv
# I don't know how to safely quote filenames with spaces in sqlite
# so instead use shell to alias it temporarily as "import.csv"
# ...also the file is in iso8559-1 (i.e. 8 bit pseudo-ascii)
# but sqlite defaults to utf-8, which is..better.
# so use iconv to translate.
#
# sed adds a bunch of ","s then cut ensures we only have exactly 11 columns, to match the 11 columns in the schema
# (some rows, seemingly all the top-level countries, e.g. ","CA",,".CANADA",,,,,,,,", have one comma too many; others, like ","US","CF3","Clifton, Mesa","Clifton, Mesa","CO","-23-----","RL","0607"", have too few)
cat "$LOCODE" | iconv -f iso8859-1 -t utf8 | sed 's/$/,,,,,,,,,/' | cut -f 1-11 -d ',' > import.csv
sqlite3 -csv latest.sqlite '.import import.csv locations'
done
# count how many locations there were versus were actually successfully imported
C_CSV=$(cat *CodeListPart*.csv | wc -l)
C_SQL=$(sqlite3 latest.sqlite 'select count(*) from locations')
echo "Loaded $C_SQL locations out of $C_CSV"
# find tags that are place names
sqlite3 -csv latest.sqlite '
select
StationCount as count,
TagName as tag
from
TagCache
join
locations
on
TagName = name_ascii
collate nocase
-- this group by is because there are multiple places, mostly towns (e.g. "York") with the same name around the world
-- and the join multiplies the single tag by the number of towns with that name
-- so the group by undoes that
group by tag
order by count desc;
' | tee place_tags.csv
It found about 2000 tags that are suspect, which is fully 30% of the entire tag set. 20% of the entire set are single-occurrence location tags, and those I think could be dropped or merged to a City tag summarily without looking back. The rest needs some hand-filtering, though, since apparently there are cities named "rock", "eclectic", "funk", "black", "opera" and "christmas".
The UN database includes label to distinguish provinces, states, cities, villages, countries; I didn't make use of that here, but it could be used to sort out what should be used to fill the potential City field.
StationCount,TagName
1041,rock
117,eclectic
111,funk
78,waynesboro
75,toronto
69,beijing
51,chicago
42,"los angeles"
39,enka
36,milano
33,montreal
33,roma
30,black
29,americana
29,jilin
27,portland
27,rai
26,boston
26,charlotte
25,berlin
25,denver
25,vancouver
24,minneapolis
24,prague
24,winnipeg
23,london
23,nashville
22,springfield
21,opera
21,pacifica
20,"san francisco"
20,soca
19,ottawa
19,phoenix
18,columbus
18,edmonton
18,orlando
18,rosario
17,christmas
17,huntsville
...
1,"west chester"
1,"west hartford"
1,"west haven"
1,"west long branch"
1,"west middlesex"
1,"west midlands"
1,westerly
1,westerville
1,westlock
1,westminster
1,weston
1,westport
1,wetaskiwin
1,"wheat ridge"
1,whistler
1,"white plains"
1,"white river"
1,whitesburg
1,whiteville
1,wickenburg
1,wiesbaden
1,"williams lake"
1,willits
1,"willow springs"
1,"windsor locks"
1,winnebago
1,winona
1,"winter harbor"
1,"winter park"
1,winters
1,winton
1,wolfforth
1,woodburn
1,worthing
1,wynne
1,wyoming
1,yakima
1,yarmouth
1,yogyakarta
1,yorkshire
1,ypsilanti
1,yukon
1,zaanstad
1,zanesville
1,zionsville
Full output here.
@kousu THIS is why I love open source things. Very well done! Thanks for putting in the time to do that!
To jump-start synonym sets, work from wordnet. That won't help out translations, but we can start with english and work outwards.
To find misspellings, use levenshtein distance. Python has this hidden in its stdlib! There's even the better-normalized ratio difference.
(but I kind of like using this version of levenshtein for the challenge of it -- and the lack of having to pull in python's dependencies: https://rosettacode.org/wiki/Levenshtein_distance#AWK)
There's also fuzzywuzzy
which adds a few other strategies for handling multi-word tokens, though I doubt that will be much of an issue here.
Perhaps it is enough to convert our findings from your above methods into a normalization using the main headings of this list of music styles.
https://en.wikipedia.org/wiki/List_of_music_styles
- African
- Arabic music
- Asian
- East Asian
- South and southeast Asian
- Avant-garde
- Blues
- Caribbean and Caribbean-influenced
- Comedy
- Country
- Easy listening
- Electronic music
- Folk
- Hip hop
- Jazz
- Latin
- Pop
- R&B and soul
- Rock
- Classical music
- Other
We lock down future tags to only be one of these, and then tags that wouldn't fall into genre would be a separate field entirely.
Or, even better, we allow the subsets of genres to be tagged, but the tags work hierarchically. For instance, if I have 2 stations, one tagged "Hard Rock" and the other tagged "Indie Rock", if I search for the "Rock" parent tag, I'll receive both stations. This way if I want "Country", for instance, the relevant stations I receive will be much wider.
Personally I like free-form tagging for the expressivity. It ends up in a mess, but in the human world there's too many things that don't fit neatly into categories, and it's even more so in the creative world. Plus, lots of stations are not exclusively music, or exclusively one kind of music.
I favour using statistics to tame the mess. You could finding clusters of related tags and use that as a kind of soft hierarchy: "Indie Rock" implies "Rock" so searching for Indie Rock brings that up first, then Rock, then other kinds of rock and pop in a long list, the way that google shows you the most related documents first and you almost never look past that but you could if you wanted to. This could probably all be implemented in the API server, so long as the clients just ask the API server for results and display them in order.
I was interested in this a few years ago but never got much further than experimenting with this formula:
which is meant to measure how likely it is that tag X implies tag Y. I've been told, by some people that lift is kind of passe, especially because it's weirdly asymmetric, but there are other metrics to explore like https://en.wikipedia.org/wiki/Jaccard_index or just looking at the straight up correlation (which if you treat a tag as a binary random variable, comes out to a related formula to lift).
I think that all this would for me be a longer-term subproject. But cool to work on!
I just don't want to /force/ a hierarchy on anyone. I want the relationships between tags to come out fluidly from the data people enter. I might guide people with it: a soft instead of hard restriction. If we did install a hard hierarchy I would want it to be community editable so that it is soft via the community, but then you need moderators for that and it's a whole complicated thing.
ID3v1 tried to lock down tags to a fixed list of genres and it just ended up in 99% of the neat new music I found online getting tagged "Blues" (genre 0) because the 80 ID3v1 genres didn't cover anything close so people just didn't bother.
📦
We should definitely hard-ban place names and file formats from tags though.
I just remembered that icecast inserts "icy-name", "icy-genre" and "icy-url" (=homepage) into the headers it returns. The streamchecker could record these.