scholia
scholia copied to clipboard
Generate Quickstatements for number of pages entry based on pages in curation aspects
What kind of panel would you like to add to which Scholia aspect? Missing number of pages
What kind of information should the panel provide, and which of the visualization options (e.g. table, bubble chart, map) should it use? quickstatement generation based on pages specifications
SELECT DISTINCT
?qs
# ?paper ?pages ?start_page ?end_page ?number_of_pages
WHERE {
{ BIND("qid,P1104" AS ?qs) }
UNION
{
?paper wdt:P50 wd:Q20984746 ;
wdt:P304 ?pages .
MINUS { ?paper wdt:P1104 [] }
FILTER REGEX(?pages, "^[1-9][0-9]*-[1-9][0-9]*$")
BIND(xsd:integer(STRBEFORE(?pages, "-")) AS ?start_page)
BIND(xsd:integer(STRAFTER(?pages, "-")) AS ?end_page)
BIND(?end_page - ?start_page + 1 AS ?number_of_pages)
FILTER (?number_of_pages > 0)
BIND(CONCAT(SUBSTR(STR(?paper), 32), ",", STR(?number_of_pages)) AS ?qs)
}
}
Good idea, and useful for most of the curation pages. I made some minor changes and am currently running a batch based on the following version of the query:
SELECT
?qs
# ?paper ?pages ?start_page ?end_page ?number_of_pages
WHERE {
{ BIND("qid,P1104,S887" AS ?qs) }
UNION
{
SERVICE bd:sample { ?paper wdt:P304 ?pages . bd:serviceParam bd:sample.limit 10000 }
MINUS { ?paper wdt:P1104 [] }
FILTER REGEX(?pages, "^[1-9][0-9]*-[1-9][0-9]*$")
BIND(xsd:integer(STRBEFORE(?pages, "-")) AS ?start_page)
BIND(xsd:integer(STRAFTER(?pages, "-")) AS ?end_page)
BIND(?end_page - ?start_page + 1 AS ?number_of_pages)
FILTER (?number_of_pages > 0)
BIND(CONCAT(SUBSTR(STR(?paper), 32), ",", STR(?number_of_pages), ",Q110768064") AS ?qs)
}
}
Looks like there's a dedicated bot for this task: https://www.wikidata.org/wiki/User:PagesBot . It has only done a few test edits, and it uses a different regex.
I left a note for the bot operator.
There are some cases where the pages are indicated in the format "1147-52", i.e. an end page with fewer digits. These cases can be caught by adding an additional STRLEN check to the query:
SELECT
?qs
# ?paper ?pages ?start_page ?end_page ?number_of_pages
WHERE {
{ BIND("qid,P1104,S887" AS ?qs) }
UNION
{
SERVICE bd:sample { ?paper wdt:P304 ?pages . bd:serviceParam bd:sample.limit 10000 }
MINUS { ?paper wdt:P1104 [] }
FILTER REGEX(?pages, "^[1-9][0-9]*-[1-9][0-9]*$")
BIND(xsd:integer(STRBEFORE(?pages, "-")) AS ?start_page)
BIND(xsd:integer(STRAFTER(?pages, "-")) AS ?end_page)
FILTER (STRLEN(?end_page) >= STRLEN(?start_page))
BIND(?end_page - ?start_page + 1 AS ?number_of_pages)
FILTER (?number_of_pages > 0)
BIND(CONCAT(SUBSTR(STR(?paper), 32), ",", STR(?number_of_pages), ",Q110768064") AS ?qs)
}
}
Conversely, a variant of the query could be used to find cases where the end page has fewer digits (currently no hits):
SELECT
# ?qs
?paper ?pages ?start_page ?end_page ?number_of_pages ?number_current
WHERE {
{ BIND("qid,P1104,S887" AS ?qs) }
UNION
{
SERVICE bd:sample { ?paper wdt:P1104 ?number_current .bd:serviceParam bd:sample.limit 100000 }
?paper wdt:P304 ?pages .
FILTER REGEX(?pages, "^[1-9][0-9]*-[1-9][0-9]*$")
BIND(xsd:integer(STRBEFORE(?pages, "-")) AS ?start_page)
BIND(xsd:integer(STRAFTER(?pages, "-")) AS ?end_page)
FILTER (STRLEN(?end_page) < STRLEN(?start_page))
BIND(?end_page - ?start_page + 1 AS ?number_of_pages)
FILTER (?number_of_pages > 0)
# BIND(CONCAT(SUBSTR(STR(?paper), 32), ",", STR(?number_of_pages), ",Q110768064") AS ?qs)
}
}
Sometimes, there are several P1104
statements on an item, so we should add a DISTINCT
to the SELECT
clause:
SELECT
?qs
# ?paper ?pages ?start_page ?end_page ?number_of_pages
WHERE {
{ BIND("qid,P1104,S887" AS ?qs) }
UNION
{
SERVICE bd:sample { ?paper wdt:P304 ?pages . bd:serviceParam bd:sample.limit 10000 }
MINUS { ?paper wdt:P1104 [] }
FILTER REGEX(?pages, "^[1-9][0-9]*-[1-9][0-9]*$")
BIND(xsd:integer(STRBEFORE(?pages, "-")) AS ?start_page)
BIND(xsd:integer(STRAFTER(?pages, "-")) AS ?end_page)
FILTER (STRLEN(?end_page) >= STRLEN(?start_page))
BIND(?end_page - ?start_page + 1 AS ?number_of_pages)
FILTER (?number_of_pages > 0)
BIND(CONCAT(SUBSTR(STR(?paper), 32), ",", STR(?number_of_pages), ",Q110768064") AS ?qs)
}
}
Sample query for an author curation page:
PREFIX target: <http://www.wikidata.org/entity/Q84097812>
SELECT
?qs
# ?paper ?pages ?start_page ?end_page ?number_of_pages
WHERE {
{ BIND("qid,P1104,S887" AS ?qs) }
UNION
{
?paper wdt:P50 target: .
?paper wdt:P304 ?pages .
MINUS { ?paper wdt:P1104 [] }
FILTER REGEX(?pages, "^[1-9][0-9]*-[1-9][0-9]*$")
BIND(xsd:integer(STRBEFORE(?pages, "-")) AS ?start_page)
BIND(xsd:integer(STRAFTER(?pages, "-")) AS ?end_page)
FILTER (STRLEN(?end_page) >= STRLEN(?start_page))
BIND(?end_page - ?start_page + 1 AS ?number_of_pages)
FILTER (?number_of_pages > 0)
BIND(CONCAT(SUBSTR(STR(?paper), 32), ",", STR(?number_of_pages), ",Q110768064") AS ?qs)
}
}
Corresponding batch: https://quickstatements.toolforge.org/#/batch/104378 .
Here is a variant for topic curation:
PREFIX target: <http://www.wikidata.org/entity/Q41112>
SELECT
?qs
# ?paper ?pages ?start_page ?end_page ?number_of_pages
WHERE {
{ BIND("qid,P1104,S887" AS ?qs) }
UNION
{
SERVICE bd:sample { ?paper wdt:P921 target: . bd:serviceParam bd:sample.limit 5000 }
?paper wdt:P304 ?pages .
MINUS { ?paper wdt:P1104 [] }
FILTER REGEX(?pages, "^[1-9][0-9]*-[1-9][0-9]*$")
BIND(xsd:integer(STRBEFORE(?pages, "-")) AS ?start_page)
BIND(xsd:integer(STRAFTER(?pages, "-")) AS ?end_page)
FILTER (STRLEN(?end_page) >= STRLEN(?start_page))
BIND(?end_page - ?start_page + 1 AS ?number_of_pages)
FILTER (?number_of_pages > 0)
BIND(CONCAT(SUBSTR(STR(?paper), 32), ",", STR(?number_of_pages), ",Q110768064") AS ?qs)
}
}
I introduced the bd:sample
line because some topics have multiple 10k papers, and I chose the 5k limit there to be comfortable when working with manual QuickStatement batches (sample batch). Since some corporate authors can reach similar levels (sample batch for the US CDC), this may be worth considering for author curation too.
Variant for use curation: https://w.wiki/5vHn
PREFIX target: <http://www.wikidata.org/entity/Q1659584>
SELECT
?qs
# ?paper ?pages ?start_page ?end_page ?number_of_pages
WHERE {
{ BIND("qid,P1104,S887" AS ?qs) }
UNION
{
SERVICE bd:sample { ?paper wdt:P4510 target: . bd:serviceParam bd:sample.limit 5000 }
?paper wdt:P304 ?pages .
MINUS { ?paper wdt:P1104 [] }
FILTER REGEX(?pages, "^[1-9][0-9]*-[1-9][0-9]*$")
BIND(xsd:integer(STRBEFORE(?pages, "-")) AS ?start_page)
BIND(xsd:integer(STRAFTER(?pages, "-")) AS ?end_page)
FILTER (STRLEN(?end_page) >= STRLEN(?start_page))
BIND(?end_page - ?start_page + 1 AS ?number_of_pages)
FILTER (?number_of_pages > 0)
BIND(CONCAT(SUBSTR(STR(?paper), 32), ",", STR(?number_of_pages), ",Q110768064") AS ?qs)
}
}
In all of the above, the SELECT
clause for qs
should use DISTINCT
.
Here is a variant for a venue:
PREFIX target: <http://www.wikidata.org/entity/Q2093109>
SELECT
DISTINCT
?qs
# ?paper ?pages ?start_page ?end_page ?number_of_pages
WHERE {
{ BIND("qid,P1104,S887" AS ?qs) }
UNION
{
SERVICE bd:sample { ?paper wdt:P1433 target: . bd:serviceParam bd:sample.limit 5000 }
?paper wdt:P304 ?pages .
MINUS { ?paper wdt:P1104 [] }
FILTER REGEX(?pages, "^[1-9][0-9]*-[1-9][0-9]*$")
BIND(xsd:integer(STRBEFORE(?pages, "-")) AS ?start_page)
BIND(xsd:integer(STRAFTER(?pages, "-")) AS ?end_page)
FILTER (STRLEN(?end_page) >= STRLEN(?start_page))
BIND(?end_page - ?start_page + 1 AS ?number_of_pages)
FILTER (?number_of_pages > 0)
BIND(CONCAT(SUBSTR(STR(?paper), 32), ",", STR(?number_of_pages), ",Q110768064") AS ?qs)
}
}
There has been an earlier ticket on this:
- #917
I think that this recent increase is almost entirely due to this thread. I found it most efficient to work by venue, since that makes it easy to avoid overlap. I would still prefer to have a dedicated bot, but no response so far from the PagesBot operator.
Looks like there's a dedicated bot for this task: https://www.wikidata.org/wiki/User:PagesBot . It has only done a few test edits, and it uses a different regex.
Good job finding it! I've just uploaded the code I used while testing: https://github.com/guyfawcus/PagesBot
I would still prefer to have a dedicated bot, but no response so far from the PagesBot operator.
@Daniel-Mietchen: Sorry for the delay! The day job has swallowed up all of my free time recently :slightly_frowning_face:.
I'm more than happy to transfer the account if that would suit. I'm still pretty busy at the moment and not sure I can commit to replying quickly enough to get the bot approved unfortunately.
Failing that, I've got a Pi waiting for a job just like this and my schedule is clearing up in about a month, so if you don't mind doing some of the legwork I could get it running as soon as it's good to go :+1:.
There may be other characters used for the dash: https://www.wikidata.org/wiki/Q56909883
So, so many! The regex I used (^(\d*)\s*(-|‐|‑|‒|–|—|―|−)\s*(\d*)$
) tested for the following:
-
U+002D
-
Hyphen-minus -
U+2010
‐
Hyphen -
U+2011
‑
Non-breaking_hyphen -
U+2012
‒
Figure dash -
U+2013
–
En dash (this is the one in your example, @fnielsen) -
U+2014
—
Em dash -
U+2015
―
Horizontal bar -
U+2212
−
Minus sign
The Wiki page for the dash character lists even more but I think the ones above are the most likely to appear in the wild.
@guyfawcus Thanks for chiming in here. I had a look at your code and think it can handle the typical cases in which the page numbers are expressed, but it misses multiple classes of edge cases.
Here is a query that gives examples for non-digits in page numbers:
While most of these essentially add a prefix of non-digit characters to the digits we could use for computing the number of pages, there are some cases not fitting that pattern, especially Roman numerals like VI-VII or explanatory strings like "A688-9; author reply A89-91".
Good point, a most of those are easy pickings with a small change.
https://regexr.com/72pr9:
I'm a little hesitant to put too much work into the others though to be honest (at least initially), especially if they don't match the page(s) format constraint (as is the case with Q24816581 and Q71868115). Just seems a little tricky and error prone. My thinking with things like this is that they should be logged so that you can go over them later by hand, or with a specific regex if there ends up being enough of them.
Roman numerals shouldn't be too tricky though, I've just added an issue to the PagesBot project!
Current statistics:
38,683,591 scholarly articles — https://scholia.toolforge.org/statistics 1,472,969 P1104 statements — https://w.wiki/5$P6 1,278,126 derived P1104 statements — https://w.wiki/5$P2
@fnielsen: Sorry for somewhat co-opting this issue! Thought these were useful numbers though.
DISTINCT
was missing from the original query. Redone it.
Finally submitted a bot request for PagesBot! Apologies for the insane delay :sweat_smile: