whosonfirst-data
whosonfirst-data copied to clipboard
Wikidata/Wikipedia cleaning tasks
I have collected some Wikidata/Wikipedia cleaning - tasks.
Before importing new data need some cleaning.
Wikidata is changing and evolving every day, so probably this tasks should run regularly.
Some task is very easy - some is very hard.
Wikidata
-
[ ] Cleaning wof-> wd("Wikimedia disambiguation pages") ( ~ 2635 wof-wikidataid )
- [ ] Find wof->wikidata records ( easy task )
- [ ] Check the probability of the '"bad translations"'
- [ ] Clean ['wd:id','wk:page','name*')
-
[ ] Cleaning wof-> Strange wikidata records ( instance of ("film", "human","fictional characters" , ... )
- [ ] Find
- probably ~ 2000 wof records ;
- complex query ; need some extra manual check;
- [ ] Check the probability of the '"bad translations"' , bad names
- example: https://spelunker.whosonfirst.org/id/102552665/
- [ ] Clean
- [ ] Find
-
[ ] Find & Update : Wikidata redirected values ( probably not exist yet, but ...)
-
[ ] Find & Check : instance of "Wikimedia duplicated page" https://www.wikidata.org/wiki/Q17362920
- Valid, but not perfect version
-
[ ] Find & Analyze : cebwiki/svwiki - import related wikidataid-s current status
- see: https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/2018/06#A_proposed_course_of_action_for_dealing_with_cebwiki/svwiki_geographic_duplicates
- Replace if better wikidata exists now.
- ongoing work :(
-
[ ] Find & Analyze & Fix : Wikidata duplicates ( #829 )
- this should be the latest task, because lot of duplicates will be removed by TASK:"Strange wikidata records"
-
[ ] Check & Analyze: wikidata - without GPS coordinate.( locality, localadmin, )
-
[ ] Check & Analyze: wof-wikidata - with extreme distance ( locality, localadmin, ) > 300km , > 1000km
-
[ ] Check & Analyze: wof-wikidata - different country code
-
[ ] Check & Analyze: past wikidata updates & survived-'"bad translations"'
wikipedia ("wk:page")
- [ ] Find & Fix : bad wikipedia links ( different than wikidata related ) example: https://spelunker.whosonfirst.org/id/101765489/ ( '"wk:page": "Chiang Rai International Airport"' incorrect ) Or remove all "wk:page" - because minimal business value, and can be recreated via wikidata API
Comments:
'"bad translations"' == (#976) like a 'Door County' vs. "Door" (that is, the entry point to a buildling)
sample: ("Wikimedia disambiguation")
+-----------+-----------+---------------------+--------------------+------------+
| id | wd_id | wof_name | wd_label | metatable |
+-----------+-----------+---------------------+--------------------+------------+
| 102547857 | Q16768887 | Chiang Rai Airport | Chiang Rai Airport | wof_campus |
| 102552251 | Q4848857 | Bajawa Airport | Bajawa Airport | wof_campus |
| 102557029 | Q7232429 | Portsmouth Airport | Portsmouth Airport | wof_campus |
| 102063611 | Q959414 | Hof | Hof | wof_county |
| 102063625 | Q959414 | Hof | Hof | wof_county |
| 102063739 | Q405583 | Olpe | Olpe | wof_county |
| 102063893 | Q409412 | Borken | Borken | wof_county |
| 102063979 | Q422291 | Verden | Verden | wof_county |
+-----------+-----------+---------------------+--------------------+------------+
sample ( strange wikidata )
+-----------+-----------+----------------+--------------------------+------------+
| id | wd_id | wof_name | wd_label | metatable |
+-----------+-----------+----------------+--------------------------+------------+
| 102552665 | Q4859047 | Borovo Airport | Jat Airways destinations | wof_campus | list
| 102048447 | Q8040109 | Wyndham | Wyndham Emery | wof_county | Welsh rugby league player
| 102048731 | Q20712693 | Salisbury | Salisbury F.C. | wof_county | football club
| 102048981 | Q7017413 | Newcastle | Newcastle F.C. | wof_county |
| 102048985 | Q3500796 | Beverley | Beverley's | wof_county | music ?
| 102049095 | Q17023107 | Port Stephens | Port Stephens Examiner | wof_county | newspaper
| 102049185 | Q548928 | Charles Sturt | Charles Sturt | wof_county | Australian explorer
| 102049195 | Q6911295 | Moreland | Moreland F.C. | wof_county |
| 102049387 | Q1130849 | Liverpool | Liverpool F.C. | wof_county |
| 102049585 | Q1317902 | Wellington | Wellington's Victory | wof_county |symphony
+-----------+-----------+----------------+--------------------------+------------+
Hey @ImreSamu - we've done additional work with Wikidata since this issue was filed and I'm planning additional name translation work through https://github.com/whosonfirst-data/whosonfirst-data/issues/1821.
These are all valid issues, though I'm most interested in some of the "Cleaning wof" tasks above, like:
- Check the probability of the '"bad translations"'
- Cleaning wof-> Strange wikidata records ( instance of ("film", "human","fictional characters")
- "bad" translations
Do you happen to have a list of records that you'd recommend we take a look at? Or any tools you've used to find the ~2000 records mentioned above?
@stepps00 : I have created a fresh new list , not perfect , but imho:useful https://gist.github.com/ImreSamu/bba79ab8093af8b4f893b9142f64fe9a
- now 10733 unique wikidata id (
wd_id
) flagged ; and probably ~80% is incorrect match ... - I have some JSON decoding problems with the WOF data .. so I can't import to Postgres yet :disappointed: .. the new repo is a big change ...
- the current wikidata ('wof') is based on this simple code; include all values ( + deprecated ! )
find /wof/whosonfirst-data/whosonfirst-data-admin-* -name *.geojson -exec cat {} + | grep "wd:id" | cut -d'"' -f4 > /wof/whosonfirst-data/wd.txt
sort -u -o /wof/whosonfirst-data/wd.txt /wof/whosonfirst-data/wd.txt
- the important tags in the
a_wof_type
( array )
'wikimedia' -- need check .. not all bad : ~95% bad matches BE Careful!
-- Q14204246 "Wikimedia project page "
-- Q17442446 "Wikimedia internal item"
-- Q13406463 "Wikimedia list article"
,'blacklist' -- ~ experimental; 80% bad matches ; hand made list
,'disambiguation' -- all bad : 100% bad matches ( Q4167410 "Wikimedia disambiguation page")
,'business' -- maybe: ~80% bad matches; but some are correct. ( Q4830453 "business")
,'demolished' -- check , lot of bad matches ; https://www.wikidata.org/wiki/Property:P576
,'duplicated' -- check , lot of bad matches ; P31(instance of) "duplicated" ( Q17362920 "Wikimedia duplicated page")
,'hasP279' -- check; "subclass of" ; https://www.wikidata.org/wiki/Property:P279
,'fictional' -- maybe: ~99% bad matches; except the "Null Island"
,'redirected' -- todo: wikidata redirected .. need improvements ; not perfect yet ..
Or any tools you've used to find the ~2000 records mentioned above?
I hope I can push the updated code to the github repo .. ( in the ~ next weeks )
- it is in Golang - processing wikidata json dumps ( it is big ~80Gb) ..
- so the code is not perfect for python fetch ...
- on the other hand: probably the metadata is useful for your code .. and you can reuse .. ( todo .. )
my biggest problem - I can't create an SQLite distribution .. I have strange bugs .. it should work ?
don't forget; and this is not included in my list:
- need some code checking the distance ( wikidata <----> wof ) .. and if this is extreme large ---> flag ..
- or check the wikidata country ... ( https://www.wikidata.org/wiki/Property:P17 ) and if this different than the repo code --> flag ..
EDIT:
- 'wikimedia' is not all bad match .. be careful
probably the easiest task is cleaning disambiguation
values ..
cat wof_wikidata_need_check_2020Apr12.txt | grep disambiguation | wc -l
2268
$ cat wof_wikidata_need_check_2020Apr12.txt | grep disambiguation | head
| Q1002273 | {wof,wikimedia,disambiguation,blacklist} | Lougheed |
| Q1021967 | {wof,wikimedia,disambiguation,blacklist} | Bălţata |
| Q1024707 | {wof,wikimedia,disambiguation,blacklist} | Cabana |
| Q1027491 | {wof,wikimedia,disambiguation,blacklist} | Embarcadero |
| Q10321658 | {wof,wikimedia,disambiguation,blacklist} | Fathabad |
| Q1038063 | {wof,P6766,wikimedia,disambiguation,blacklist} | Lanesville |
| Q1038169 | {wof,P6766,wikimedia,disambiguation,blacklist} | World's End |
| Q1038444 | {wof,wikimedia,disambiguation,blacklist} | Jalali |
| Q1038503 | {wof,wikimedia,disambiguation,blacklist} | Lawrence Park |
| Q1038807 | {wof,wikimedia,disambiguation,blacklist} | Labo |
...
$
and fictional
( except the Null Island
! )
$ cat wof_wikidata_need_check_2020Apr12.txt | grep fictional
| Q11689382 | {fictional,wof,P6766} | Walford |
| Q16896007 | {fictional,wof,P6766,hasP625,blacklist}| Null Island | it is OK !
| Q1941127 | {fictional,wof,blacklist} | Suvarnabhumi |
| Q218562 | {fictional,wof} | Rivendell |
| Q2479024 | {fictional,wof,P6766} | Sunnydale |
| Q2829261 | {fictional,wof,blacklist} | Al-Qadim |
| Q3804953 | {fictional,wof} | Ivy Town |
| Q3820697 | {fictional,wof,blacklist} | La Canela |
| Q464520 | {fictional,wof} | Shangri-La |
| Q4875964 | {fictional,wof} | Beacon Hills |
| Q566 | {fictional,wof,hasP279} | purgatory |
| Q932923 | {fictional,wof} | Old Forest |
some important metadata: ( wikidata_*.csv ---> fake csv ; with comment line #
)
-
disambiguation
: https://github.com/whosonfirst/concordances-whosonfirst-wikidata/blob/master/code/wikidata_disambiguation.csv -
fictional
: https://github.com/whosonfirst/concordances-whosonfirst-wikidata/blob/master/code/wikidata_fictional.csv -
duplicated
: https://github.com/whosonfirst/concordances-whosonfirst-wikidata/blob/master/code/wikidata_duplicated.csv -
wikimedia
: https://github.com/whosonfirst/concordances-whosonfirst-wikidata/blob/master/code/wikidata_wikimedia.csv -
blacklist
: https://github.com/whosonfirst/concordances-whosonfirst-wikidata/blob/master/code/wikidata_blacklist.csv - ... other:
./code/wikidata_*.csv
the wikidata hierarchy is always changing .. generated with wdtaxonomy tools ..
- https://github.com/whosonfirst/concordances-whosonfirst-wikidata/blob/master/wdtaxonomy/gen_wikidata.sh
example campus:
the campus: generated from Q62447 + Q194188
# campus
wdtclr campus
wdtadd campus Q62447 aerodrome
wdtadd campus Q194188 spaceport
and the result is this big list:
- https://github.com/whosonfirst/concordances-whosonfirst-wikidata/blob/master/code/wikidata_campus.csv
json dump parsing :
wikidata Json dump parsed and loaded to Postgres with ./code/wdpp.go
( but this is more complex .. ; not documented ; work in progress ..)
Thank you @ImreSamu! I'll dig into these files and notes once I start merging wikidata and name work.
I have just finished removing backlinks to wof that had been placed on non-geographical wikidata items. I have generated a tsv file that lists the wof items that are pointing to these wikidata items: https://gist.github.com/bamyers99/467e31c9791701e5c21a729d3dfcf1a0
Thanks @bamyers99 - your gist is very helpful. For these ~1800 WOF records, it looks like the solution is to:
- Remove the existing
wd:id
concordance - Review
name:
properties (some names were pulled from Wikidata)