scholia icon indicating copy to clipboard operation
scholia copied to clipboard

Generate Quickstatements for number of pages entry based on pages in curation aspects

Open fnielsen opened this issue 2 years ago • 21 comments

What kind of panel would you like to add to which Scholia aspect? Missing number of pages

What kind of information should the panel provide, and which of the visualization options (e.g. table, bubble chart, map) should it use? quickstatement generation based on pages specifications

SELECT DISTINCT
  ?qs 
  # ?paper ?pages ?start_page ?end_page ?number_of_pages
WHERE {
  { BIND("qid,P1104" AS ?qs) }
  UNION
  {
    ?paper wdt:P50 wd:Q20984746 ;
         wdt:P304 ?pages .
    MINUS { ?paper wdt:P1104 [] }
    FILTER REGEX(?pages, "^[1-9][0-9]*-[1-9][0-9]*$")
    BIND(xsd:integer(STRBEFORE(?pages, "-")) AS ?start_page)
    BIND(xsd:integer(STRAFTER(?pages, "-")) AS ?end_page)
    BIND(?end_page - ?start_page + 1 AS ?number_of_pages)
    FILTER (?number_of_pages > 0)
    BIND(CONCAT(SUBSTR(STR(?paper), 32), ",", STR(?number_of_pages)) AS ?qs)
  }
}

fnielsen avatar Oct 31 '22 15:10 fnielsen

Good idea, and useful for most of the curation pages. I made some minor changes and am currently running a batch based on the following version of the query:

SELECT 
?qs 
# ?paper ?pages ?start_page ?end_page ?number_of_pages
WHERE {
  { BIND("qid,P1104,S887" AS ?qs) }
  UNION
  {
    SERVICE bd:sample { ?paper wdt:P304 ?pages . bd:serviceParam bd:sample.limit 10000 }
    MINUS { ?paper wdt:P1104 [] }
    FILTER REGEX(?pages, "^[1-9][0-9]*-[1-9][0-9]*$")
    BIND(xsd:integer(STRBEFORE(?pages, "-")) AS ?start_page)
    BIND(xsd:integer(STRAFTER(?pages, "-")) AS ?end_page)
    BIND(?end_page - ?start_page + 1 AS ?number_of_pages)
    FILTER (?number_of_pages > 0)
    BIND(CONCAT(SUBSTR(STR(?paper), 32), ",", STR(?number_of_pages), ",Q110768064") AS ?qs)
  }
}

Daniel-Mietchen avatar Nov 03 '22 21:11 Daniel-Mietchen

Looks like there's a dedicated bot for this task: https://www.wikidata.org/wiki/User:PagesBot . It has only done a few test edits, and it uses a different regex.

Daniel-Mietchen avatar Nov 03 '22 21:11 Daniel-Mietchen

I left a note for the bot operator.

Daniel-Mietchen avatar Nov 03 '22 21:11 Daniel-Mietchen

There are some cases where the pages are indicated in the format "1147-52", i.e. an end page with fewer digits. These cases can be caught by adding an additional STRLEN check to the query:

SELECT 
?qs 
# ?paper ?pages ?start_page ?end_page ?number_of_pages
WHERE {
  { BIND("qid,P1104,S887" AS ?qs) }
  UNION
  {
    SERVICE bd:sample { ?paper wdt:P304 ?pages . bd:serviceParam bd:sample.limit 10000 }
    MINUS { ?paper wdt:P1104 [] }
    FILTER REGEX(?pages, "^[1-9][0-9]*-[1-9][0-9]*$")
    BIND(xsd:integer(STRBEFORE(?pages, "-")) AS ?start_page)
    BIND(xsd:integer(STRAFTER(?pages, "-")) AS ?end_page)
    FILTER (STRLEN(?end_page) >= STRLEN(?start_page))
    BIND(?end_page - ?start_page + 1 AS ?number_of_pages)
    FILTER (?number_of_pages > 0)
    BIND(CONCAT(SUBSTR(STR(?paper), 32), ",", STR(?number_of_pages), ",Q110768064") AS ?qs)
  }
}

Conversely, a variant of the query could be used to find cases where the end page has fewer digits (currently no hits):


SELECT 
# ?qs 
?paper ?pages ?start_page ?end_page ?number_of_pages ?number_current
WHERE {
  { BIND("qid,P1104,S887" AS ?qs) }
  UNION
  {
    SERVICE bd:sample { ?paper wdt:P1104 ?number_current .bd:serviceParam bd:sample.limit 100000 }
    ?paper wdt:P304 ?pages . 
    FILTER REGEX(?pages, "^[1-9][0-9]*-[1-9][0-9]*$")
    BIND(xsd:integer(STRBEFORE(?pages, "-")) AS ?start_page)
    BIND(xsd:integer(STRAFTER(?pages, "-")) AS ?end_page)
    FILTER (STRLEN(?end_page) < STRLEN(?start_page))
    BIND(?end_page - ?start_page + 1 AS ?number_of_pages)
    FILTER (?number_of_pages > 0)
#     BIND(CONCAT(SUBSTR(STR(?paper), 32), ",", STR(?number_of_pages), ",Q110768064") AS ?qs)
  }
}

Daniel-Mietchen avatar Nov 04 '22 14:11 Daniel-Mietchen

Sometimes, there are several P1104 statements on an item, so we should add a DISTINCT to the SELECT clause:

SELECT 
?qs 
# ?paper ?pages ?start_page ?end_page ?number_of_pages
WHERE {
  { BIND("qid,P1104,S887" AS ?qs) }
  UNION
  {
    SERVICE bd:sample { ?paper wdt:P304 ?pages . bd:serviceParam bd:sample.limit 10000 }
    MINUS { ?paper wdt:P1104 [] }
    FILTER REGEX(?pages, "^[1-9][0-9]*-[1-9][0-9]*$")
    BIND(xsd:integer(STRBEFORE(?pages, "-")) AS ?start_page)
    BIND(xsd:integer(STRAFTER(?pages, "-")) AS ?end_page)
    FILTER (STRLEN(?end_page) >= STRLEN(?start_page))
    BIND(?end_page - ?start_page + 1 AS ?number_of_pages)
    FILTER (?number_of_pages > 0)
    BIND(CONCAT(SUBSTR(STR(?paper), 32), ",", STR(?number_of_pages), ",Q110768064") AS ?qs)
  }
}

Daniel-Mietchen avatar Nov 06 '22 12:11 Daniel-Mietchen

Sample query for an author curation page:

PREFIX target: <http://www.wikidata.org/entity/Q84097812> 

SELECT 
?qs 
# ?paper ?pages ?start_page ?end_page ?number_of_pages
WHERE {
  { BIND("qid,P1104,S887" AS ?qs) }
  UNION
  {
    ?paper wdt:P50 target: . 
    ?paper wdt:P304 ?pages .     
    MINUS { ?paper wdt:P1104 [] }
    FILTER REGEX(?pages, "^[1-9][0-9]*-[1-9][0-9]*$")
    BIND(xsd:integer(STRBEFORE(?pages, "-")) AS ?start_page)
    BIND(xsd:integer(STRAFTER(?pages, "-")) AS ?end_page)
    FILTER (STRLEN(?end_page) >= STRLEN(?start_page))
    BIND(?end_page - ?start_page + 1 AS ?number_of_pages)
    FILTER (?number_of_pages > 0)
    BIND(CONCAT(SUBSTR(STR(?paper), 32), ",", STR(?number_of_pages), ",Q110768064") AS ?qs)
  }
}


Corresponding batch: https://quickstatements.toolforge.org/#/batch/104378 .

Daniel-Mietchen avatar Nov 06 '22 23:11 Daniel-Mietchen

Here is a variant for topic curation:


PREFIX target: <http://www.wikidata.org/entity/Q41112> 

SELECT 
?qs 
# ?paper ?pages ?start_page ?end_page ?number_of_pages
WHERE {
  { BIND("qid,P1104,S887" AS ?qs) }
  UNION
  {
    SERVICE bd:sample { ?paper wdt:P921 target: . bd:serviceParam bd:sample.limit 5000 } 
    ?paper wdt:P304 ?pages .     
    MINUS { ?paper wdt:P1104 [] }
    FILTER REGEX(?pages, "^[1-9][0-9]*-[1-9][0-9]*$")
    BIND(xsd:integer(STRBEFORE(?pages, "-")) AS ?start_page)
    BIND(xsd:integer(STRAFTER(?pages, "-")) AS ?end_page)
    FILTER (STRLEN(?end_page) >= STRLEN(?start_page))
    BIND(?end_page - ?start_page + 1 AS ?number_of_pages)
    FILTER (?number_of_pages > 0)
    BIND(CONCAT(SUBSTR(STR(?paper), 32), ",", STR(?number_of_pages), ",Q110768064") AS ?qs)
  }
}

I introduced the bd:sample line because some topics have multiple 10k papers, and I chose the 5k limit there to be comfortable when working with manual QuickStatement batches (sample batch). Since some corporate authors can reach similar levels (sample batch for the US CDC), this may be worth considering for author curation too.

Daniel-Mietchen avatar Nov 07 '22 00:11 Daniel-Mietchen

Variant for use curation: https://w.wiki/5vHn

PREFIX target: <http://www.wikidata.org/entity/Q1659584> 

SELECT 
?qs 
# ?paper ?pages ?start_page ?end_page ?number_of_pages
WHERE {
  { BIND("qid,P1104,S887" AS ?qs) }
  UNION
  {
    SERVICE bd:sample { ?paper wdt:P4510 target: . bd:serviceParam bd:sample.limit 5000 } 
    ?paper wdt:P304 ?pages .     
    MINUS { ?paper wdt:P1104 [] }
    FILTER REGEX(?pages, "^[1-9][0-9]*-[1-9][0-9]*$")
    BIND(xsd:integer(STRBEFORE(?pages, "-")) AS ?start_page)
    BIND(xsd:integer(STRAFTER(?pages, "-")) AS ?end_page)
    FILTER (STRLEN(?end_page) >= STRLEN(?start_page))
    BIND(?end_page - ?start_page + 1 AS ?number_of_pages)
    FILTER (?number_of_pages > 0)
    BIND(CONCAT(SUBSTR(STR(?paper), 32), ",", STR(?number_of_pages), ",Q110768064") AS ?qs)
  }
}

Daniel-Mietchen avatar Nov 07 '22 00:11 Daniel-Mietchen

In all of the above, the SELECT clause for qs should use DISTINCT.

Daniel-Mietchen avatar Nov 10 '22 11:11 Daniel-Mietchen

Here is a variant for a venue:

PREFIX target: <http://www.wikidata.org/entity/Q2093109>

SELECT 
  DISTINCT
    ?qs 
#     ?paper ?pages ?start_page ?end_page ?number_of_pages
WHERE {
  { BIND("qid,P1104,S887" AS ?qs) }
  UNION
  {
    SERVICE bd:sample { ?paper wdt:P1433 target: . bd:serviceParam bd:sample.limit 5000 } 
    ?paper wdt:P304 ?pages .     
    MINUS { ?paper wdt:P1104 [] }
    FILTER REGEX(?pages, "^[1-9][0-9]*-[1-9][0-9]*$")
    BIND(xsd:integer(STRBEFORE(?pages, "-")) AS ?start_page)
    BIND(xsd:integer(STRAFTER(?pages, "-")) AS ?end_page)
    FILTER (STRLEN(?end_page) >= STRLEN(?start_page))
    BIND(?end_page - ?start_page + 1 AS ?number_of_pages)
    FILTER (?number_of_pages > 0)
    BIND(CONCAT(SUBSTR(STR(?paper), 32), ",", STR(?number_of_pages), ",Q110768064") AS ?qs)
  }
}

Daniel-Mietchen avatar Nov 10 '22 11:11 Daniel-Mietchen

There has been an earlier ticket on this:

  • #917

Daniel-Mietchen avatar Nov 10 '22 22:11 Daniel-Mietchen

Usage history of P1104:

Screenshot 2022-11-13 at 14-20-51 Wikidata Query Service

I think that this recent increase is almost entirely due to this thread. I found it most efficient to work by venue, since that makes it easy to avoid overlap. I would still prefer to have a dedicated bot, but no response so far from the PagesBot operator.

Daniel-Mietchen avatar Nov 13 '22 13:11 Daniel-Mietchen

Looks like there's a dedicated bot for this task: https://www.wikidata.org/wiki/User:PagesBot . It has only done a few test edits, and it uses a different regex.

Good job finding it! I've just uploaded the code I used while testing: https://github.com/guyfawcus/PagesBot

guyfawcus avatar Nov 17 '22 11:11 guyfawcus

I would still prefer to have a dedicated bot, but no response so far from the PagesBot operator.

@Daniel-Mietchen: Sorry for the delay! The day job has swallowed up all of my free time recently :slightly_frowning_face:.

I'm more than happy to transfer the account if that would suit. I'm still pretty busy at the moment and not sure I can commit to replying quickly enough to get the bot approved unfortunately.

Failing that, I've got a Pi waiting for a job just like this and my schedule is clearing up in about a month, so if you don't mind doing some of the legwork I could get it running as soon as it's good to go :+1:.

guyfawcus avatar Nov 17 '22 11:11 guyfawcus

There may be other characters used for the dash: https://www.wikidata.org/wiki/Q56909883

fnielsen avatar Nov 18 '22 10:11 fnielsen

So, so many! The regex I used (^(\d*)\s*(-|‐|‑|‒|–|—|―|−)\s*(\d*)$) tested for the following:

The Wiki page for the dash character lists even more but I think the ones above are the most likely to appear in the wild.

guyfawcus avatar Nov 18 '22 19:11 guyfawcus

@guyfawcus Thanks for chiming in here. I had a look at your code and think it can handle the typical cases in which the page numbers are expressed, but it misses multiple classes of edge cases.

Here is a query that gives examples for non-digits in page numbers: image

While most of these essentially add a prefix of non-digit characters to the digits we could use for computing the number of pages, there are some cases not fitting that pattern, especially Roman numerals like VI-VII or explanatory strings like "A688-9; author reply A89-91".

Daniel-Mietchen avatar Nov 20 '22 04:11 Daniel-Mietchen

Good point, a most of those are easy pickings with a small change.

https://regexr.com/72pr9:

pages-regex

I'm a little hesitant to put too much work into the others though to be honest (at least initially), especially if they don't match the page(s) format constraint (as is the case with Q24816581 and Q71868115). Just seems a little tricky and error prone. My thinking with things like this is that they should be logged so that you can go over them later by hand, or with a specific regex if there ends up being enough of them.

Roman numerals shouldn't be too tricky though, I've just added an issue to the PagesBot project!

guyfawcus avatar Nov 20 '22 11:11 guyfawcus

Current statistics:

38,683,591 scholarly articles — https://scholia.toolforge.org/statistics 1,472,969 P1104 statements — https://w.wiki/5$P6 1,278,126 derived P1104 statements — https://w.wiki/5$P2

@fnielsen: Sorry for somewhat co-opting this issue! Thought these were useful numbers though.

guyfawcus avatar Nov 22 '22 12:11 guyfawcus

DISTINCT was missing from the original query. Redone it.

fnielsen avatar Mar 29 '23 13:03 fnielsen

Finally submitted a bot request for PagesBot! Apologies for the insane delay :sweat_smile:

guyfawcus avatar Mar 11 '24 20:03 guyfawcus