whosonfirst-data icon indicating copy to clipboard operation
whosonfirst-data copied to clipboard

Wikidata/Wikipedia cleaning tasks

Open ImreSamu opened this issue 5 years ago • 7 comments

I have collected some Wikidata/Wikipedia cleaning - tasks. Before importing new data need some cleaning.
Wikidata is changing and evolving every day, so probably this tasks should run regularly. Some task is very easy - some is very hard.

Wikidata

  • [ ] Cleaning wof-> wd("Wikimedia disambiguation pages") ( ~ 2635 wof-wikidataid )

    • [ ] Find wof->wikidata records ( easy task )
    • [ ] Check the probability of the '"bad translations"'
    • [ ] Clean ['wd:id','wk:page','name*')
  • [ ] Cleaning wof-> Strange wikidata records ( instance of ("film", "human","fictional characters" , ... )

    • [ ] Find
      • probably ~ 2000 wof records ;
      • complex query ; need some extra manual check;
    • [ ] Check the probability of the '"bad translations"' , bad names
      • example: https://spelunker.whosonfirst.org/id/102552665/
    • [ ] Clean
  • [ ] Find & Update : Wikidata redirected values ( probably not exist yet, but ...)

  • [ ] Find & Check : instance of "Wikimedia duplicated page" https://www.wikidata.org/wiki/Q17362920

    • Valid, but not perfect version
  • [ ] Find & Analyze : cebwiki/svwiki - import related wikidataid-s current status

    • see: https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/2018/06#A_proposed_course_of_action_for_dealing_with_cebwiki/svwiki_geographic_duplicates
    • Replace if better wikidata exists now.
    • ongoing work :(
  • [ ] Find & Analyze & Fix : Wikidata duplicates ( #829 )

    • this should be the latest task, because lot of duplicates will be removed by TASK:"Strange wikidata records"
  • [ ] Check & Analyze: wikidata - without GPS coordinate.( locality, localadmin, )

  • [ ] Check & Analyze: wof-wikidata - with extreme distance ( locality, localadmin, ) > 300km , > 1000km

  • [ ] Check & Analyze: wof-wikidata - different country code

  • [ ] Check & Analyze: past wikidata updates & survived-'"bad translations"'

wikipedia ("wk:page")

  • [ ] Find & Fix : bad wikipedia links ( different than wikidata related ) example: https://spelunker.whosonfirst.org/id/101765489/ ( '"wk:page": "Chiang Rai International Airport"' incorrect ) Or remove all "wk:page" - because minimal business value, and can be recreated via wikidata API

Comments:

'"bad translations"' == (#976) like a 'Door County' vs. "Door" (that is, the entry point to a buildling)

sample: ("Wikimedia disambiguation")

+-----------+-----------+---------------------+--------------------+------------+
|    id     |   wd_id   |           wof_name  |           wd_label | metatable  |
+-----------+-----------+---------------------+--------------------+------------+
| 102547857 | Q16768887 | Chiang Rai Airport  | Chiang Rai Airport | wof_campus |
| 102552251 | Q4848857  | Bajawa Airport      | Bajawa Airport     | wof_campus |
| 102557029 | Q7232429  | Portsmouth Airport  | Portsmouth Airport | wof_campus |
| 102063611 | Q959414   | Hof                 | Hof                | wof_county |
| 102063625 | Q959414   | Hof                 | Hof                | wof_county |
| 102063739 | Q405583   | Olpe                | Olpe               | wof_county |
| 102063893 | Q409412   | Borken              | Borken             | wof_county |
| 102063979 | Q422291   | Verden              | Verden             | wof_county |
+-----------+-----------+---------------------+--------------------+------------+

sample ( strange wikidata )

+-----------+-----------+----------------+--------------------------+------------+
|    id     |   wd_id   |    wof_name    |         wd_label         | metatable  |
+-----------+-----------+----------------+--------------------------+------------+
| 102552665 | Q4859047  | Borovo Airport | Jat Airways destinations | wof_campus | list
| 102048447 | Q8040109  | Wyndham        | Wyndham Emery            | wof_county | Welsh rugby league player
| 102048731 | Q20712693 | Salisbury      | Salisbury F.C.           | wof_county | football club
| 102048981 | Q7017413  | Newcastle      | Newcastle F.C.           | wof_county | 
| 102048985 | Q3500796  | Beverley       | Beverley's               | wof_county | music ?
| 102049095 | Q17023107 | Port Stephens  | Port Stephens Examiner   | wof_county | newspaper
| 102049185 | Q548928   | Charles Sturt  | Charles Sturt            | wof_county | Australian explorer
| 102049195 | Q6911295  | Moreland       | Moreland F.C.            | wof_county |
| 102049387 | Q1130849  | Liverpool      | Liverpool F.C.           | wof_county |
| 102049585 | Q1317902  | Wellington     | Wellington's Victory     | wof_county |symphony
+-----------+-----------+----------------+--------------------------+------------+

ImreSamu avatar Sep 22 '18 22:09 ImreSamu

Hey @ImreSamu - we've done additional work with Wikidata since this issue was filed and I'm planning additional name translation work through https://github.com/whosonfirst-data/whosonfirst-data/issues/1821.

These are all valid issues, though I'm most interested in some of the "Cleaning wof" tasks above, like:

  • Check the probability of the '"bad translations"'
  • Cleaning wof-> Strange wikidata records ( instance of ("film", "human","fictional characters")
  • "bad" translations

Do you happen to have a list of records that you'd recommend we take a look at? Or any tools you've used to find the ~2000 records mentioned above?

stepps00 avatar Apr 09 '20 21:04 stepps00

@stepps00 : I have created a fresh new list , not perfect , but imho:useful https://gist.github.com/ImreSamu/bba79ab8093af8b4f893b9142f64fe9a

  • now 10733 unique wikidata id ( wd_id ) flagged ; and probably ~80% is incorrect match ...
  • I have some JSON decoding problems with the WOF data .. so I can't import to Postgres yet :disappointed: .. the new repo is a big change ...
  • the current wikidata ('wof') is based on this simple code; include all values ( + deprecated ! )
 find /wof/whosonfirst-data/whosonfirst-data-admin-*  -name *.geojson -exec  cat {} + | grep "wd:id" | cut -d'"' -f4 > /wof/whosonfirst-data/wd.txt
 sort -u -o /wof/whosonfirst-data/wd.txt  /wof/whosonfirst-data/wd.txt
  • the important tags in the a_wof_type ( array )
       'wikimedia' --  need check ..  not all bad : ~95% bad matches  BE Careful! 
                    --   Q14204246 "Wikimedia project page "
                    --   Q17442446 "Wikimedia internal item"
                    --   Q13406463 "Wikimedia list article"
        ,'blacklist' --  ~ experimental;  80% bad matches ; hand made list
        ,'disambiguation' -- all bad : 100% bad matches ( Q4167410  "Wikimedia disambiguation page")
        ,'business'    -- maybe: ~80% bad matches; but some are correct. ( Q4830453 "business") 
        ,'demolished'  -- check , lot of bad matches ; https://www.wikidata.org/wiki/Property:P576
        ,'duplicated'  -- check , lot of bad matches ; P31(instance of) "duplicated" ( Q17362920  "Wikimedia duplicated page")
        ,'hasP279'     -- check;  "subclass of" ; https://www.wikidata.org/wiki/Property:P279
        ,'fictional'   -- maybe: ~99% bad matches; except the "Null Island"
        ,'redirected'  -- todo:  wikidata redirected .. need improvements ; not perfect yet ..

Or any tools you've used to find the ~2000 records mentioned above?

I hope I can push the updated code to the github repo .. ( in the ~ next weeks )

  • it is in Golang - processing wikidata json dumps ( it is big ~80Gb) ..
    • so the code is not perfect for python fetch ...
  • on the other hand: probably the metadata is useful for your code .. and you can reuse .. ( todo .. )

my biggest problem - I can't create an SQLite distribution .. I have strange bugs .. it should work ?

don't forget; and this is not included in my list:

  • need some code checking the distance ( wikidata <----> wof ) .. and if this is extreme large ---> flag ..
    • or check the wikidata country ... ( https://www.wikidata.org/wiki/Property:P17 ) and if this different than the repo code --> flag ..

EDIT:

  • 'wikimedia' is not all bad match .. be careful

ImreSamu avatar Apr 12 '20 17:04 ImreSamu

probably the easiest task is cleaning disambiguation values ..

cat wof_wikidata_need_check_2020Apr12.txt  | grep disambiguation | wc -l
2268
$ cat wof_wikidata_need_check_2020Apr12.txt  | grep disambiguation | head
| Q1002273  | {wof,wikimedia,disambiguation,blacklist}                                                                  | Lougheed                                                            |
| Q1021967  | {wof,wikimedia,disambiguation,blacklist}                                                                  | Bălţata                                                           |
| Q1024707  | {wof,wikimedia,disambiguation,blacklist}                                                                  | Cabana                                                              |
| Q1027491  | {wof,wikimedia,disambiguation,blacklist}                                                                  | Embarcadero                                                         |
| Q10321658 | {wof,wikimedia,disambiguation,blacklist}                                                                  | Fathabad                                                            |
| Q1038063  | {wof,P6766,wikimedia,disambiguation,blacklist}                                                            | Lanesville                                                          |
| Q1038169  | {wof,P6766,wikimedia,disambiguation,blacklist}                                                            | World's End                                                         |
| Q1038444  | {wof,wikimedia,disambiguation,blacklist}                                                                  | Jalali                                                              |
| Q1038503  | {wof,wikimedia,disambiguation,blacklist}                                                                  | Lawrence Park                                                       |
| Q1038807  | {wof,wikimedia,disambiguation,blacklist}                                                                  | Labo                                                                |
...
$ 

and fictional ( except the Null Island! )

$ cat wof_wikidata_need_check_2020Apr12.txt  | grep fictional
| Q11689382 | {fictional,wof,P6766}                  | Walford      |
| Q16896007 | {fictional,wof,P6766,hasP625,blacklist}| Null Island  | it is OK !
| Q1941127  | {fictional,wof,blacklist}              | Suvarnabhumi |
| Q218562   | {fictional,wof}                        | Rivendell    |
| Q2479024  | {fictional,wof,P6766}                  | Sunnydale    |
| Q2829261  | {fictional,wof,blacklist}              | Al-Qadim     |
| Q3804953  | {fictional,wof}                        | Ivy Town     |
| Q3820697  | {fictional,wof,blacklist}              | La Canela    |
| Q464520   | {fictional,wof}                        | Shangri-La   |
| Q4875964  | {fictional,wof}                        | Beacon Hills |
| Q566      | {fictional,wof,hasP279}                | purgatory    |
| Q932923   | {fictional,wof}                        | Old Forest   |

ImreSamu avatar Apr 12 '20 17:04 ImreSamu

some important metadata: ( wikidata_*.csv ---> fake csv ; with comment line # )

  • disambiguation : https://github.com/whosonfirst/concordances-whosonfirst-wikidata/blob/master/code/wikidata_disambiguation.csv
  • fictional : https://github.com/whosonfirst/concordances-whosonfirst-wikidata/blob/master/code/wikidata_fictional.csv
  • duplicated : https://github.com/whosonfirst/concordances-whosonfirst-wikidata/blob/master/code/wikidata_duplicated.csv
  • wikimedia : https://github.com/whosonfirst/concordances-whosonfirst-wikidata/blob/master/code/wikidata_wikimedia.csv
  • blacklist : https://github.com/whosonfirst/concordances-whosonfirst-wikidata/blob/master/code/wikidata_blacklist.csv
  • ... other: ./code/wikidata_*.csv

the wikidata hierarchy is always changing .. generated with wdtaxonomy tools ..

  • https://github.com/whosonfirst/concordances-whosonfirst-wikidata/blob/master/wdtaxonomy/gen_wikidata.sh

example campus:

the campus: generated from Q62447 + Q194188

# campus
wdtclr campus
wdtadd campus Q62447  aerodrome
wdtadd campus Q194188 spaceport

and the result is this big list:

  • https://github.com/whosonfirst/concordances-whosonfirst-wikidata/blob/master/code/wikidata_campus.csv

json dump parsing :

wikidata Json dump parsed and loaded to Postgres with ./code/wdpp.go
( but this is more complex .. ; not documented ; work in progress ..)

ImreSamu avatar Apr 12 '20 18:04 ImreSamu

Thank you @ImreSamu! I'll dig into these files and notes once I start merging wikidata and name work.

stepps00 avatar Apr 13 '20 23:04 stepps00

I have just finished removing backlinks to wof that had been placed on non-geographical wikidata items. I have generated a tsv file that lists the wof items that are pointing to these wikidata items: https://gist.github.com/bamyers99/467e31c9791701e5c21a729d3dfcf1a0

bamyers99 avatar May 14 '21 01:05 bamyers99

Thanks @bamyers99 - your gist is very helpful. For these ~1800 WOF records, it looks like the solution is to:

  • Remove the existing wd:id concordance
  • Review name: properties (some names were pulled from Wikidata)

stepps00 avatar May 17 '21 18:05 stepps00