harvesttemplates
harvesttemplates copied to clipboard
[wish] Special case for archived URLs
HarvestTemplate doesn't distinguish between URL and (internet) archive URLs during import. So while importing 'website' information from Wikipedia infoboxes to Wikidata property P856 official website, archived URLs become official website. Wikidata has a special property archive URL P1056 for archived URLs. There are some ways by which this can be resolved:
- Give an option to to ignore URLs beginning with https://web.archive.org... (i,e., to avoid importing in such cases)
- In case of archive URLs, parse the URL to extract the original url from the complete archive URL (i.e., string after second http is the original URL)
What do you think?
The first way to fix can be done on Wikidata, by creating a constraint for the property. With the second way, "remove prefix" tool could help.
Thanks @matejsuchanek. Can you share the link of "remove prefix" tool?
https://tools.wmflabs.org/pltools/harvesttemplates/?siteid=en&project=wikipedia&namespace=0&property=856¶meter=web&removeprefix=https://web.archive.org/&offset=0&limit=10000&set=0