harvesttemplates [wish] Special case for archived URLs

[wish] Special case for archived URLs

Open johnsamuelwrites opened this issue 7 years ago • 3 comments

HarvestTemplate doesn't distinguish between URL and (internet) archive URLs during import. So while importing 'website' information from Wikipedia infoboxes to Wikidata property P856 official website, archived URLs become official website. Wikidata has a special property archive URL P1056 for archived URLs. There are some ways by which this can be resolved:

Give an option to to ignore URLs beginning with https://web.archive.org... (i,e., to avoid importing in such cases)
In case of archive URLs, parse the URL to extract the original url from the complete archive URL (i.e., string after second http is the original URL)

What do you think?

Sep 07 '17 19:09 johnsamuelwrites

The first way to fix can be done on Wikidata, by creating a constraint for the property. With the second way, "remove prefix" tool could help.

Sep 08 '17 14:09 matejsuchanek

Thanks @matejsuchanek. Can you share the link of "remove prefix" tool?

Sep 09 '17 10:09 johnsamuelwrites

https://tools.wmflabs.org/pltools/harvesttemplates/?siteid=en&project=wikipedia&namespace=0&property=856&parameter=web&removeprefix=https://web.archive.org/&offset=0&limit=10000&set=0

Sep 09 '17 10:09 matejsuchanek

harvesttemplates harvesttemplates copied to clipboard

[wish] Special case for archived URLs

harvesttemplates
harvesttemplates copied to clipboard