scholia
scholia copied to clipboard
Can we use "A large dataset of software mentions in the biomedical literature"
Context
https://arxiv.org/pdf/2209.00693.pdf
Question
Can we use "A large dataset of software mentions in the biomedical literature"
Yes. It's CC0 and supposed to be released next week.
it's available https://github.com/chanzuckerberg/software-mentions/issues/2?notification_referrer_id=NT_kwDOAGXha7I0NDA2MDkwMTMyOjY2NzY4NDM#event-7416509765 :)
To be clear the idea would be to:
- Filter the dataset to those papers which are in Wikidata
- Find Wikidata IDs for the software which is mentioned
- Add P4510 (describes a project that uses) on the papers to the QID of the software
At each step batching will have to be performed to not overwhelm WQDS. Each of these steps sounds doable to me :)
We're not adding papers or software, right? Perhaps that would be a subsequent project after this import
I think that would be a good plan. I do not think we are adding papers now.
Step 1 done https://github.com/carlinmack/qid-id/blob/main/qid-doi-pmc-pm.csv
It's a 1.4 milllion line long csv collating the IDs under each QID. Note this is a replica of the three identifiers we have in Wikidata for each item, rather than a collation of the identifiers from the dataset.
The code used in this process will be stored in that repository
For step 2 we have 97k different pieces of software after disambiguation. Going to be interesting trying to find QIDs for them
They provide identifiers for the pieces of software they have mined. Wikidata has properties for all of these identifiers, so one intermediate step could be to go through the respective repositories (e.g. Bioconductor), then check Wikidata for matching items and adding the corresponding property (e.g. Bioconductor project (P10892)). We could then use these properties to check against their data.
Another aspect of this is the creation of missing items, so it would be good to get an idea of
- papers in their corpus that are not in Wikidata yet
- software that is prominent in their corpus but missing on Wikidata
In both cases, it would be sufficient initially to just get the rough number, but eventually, we might want to have a more complete list, so we can think about creating some of those missing items.
A bit over half of the ca. 2k Bioconductor packages seem to be tagged as such already: https://w.wiki/5kWi .
Thanks for the pointer:) They have three indices they use: Bioconductor, CRAN and Pypi. There are 17,540 packages linked to these indices. Of these, there are 1,481 of these in Wikidata (554 Pypi, 913 Bioconductor and 14 CRAN).
I think it'll be best to restrict the project to the 1.4M paper QIDs we have and the 1.4k software QIDs we have for now
They also checked GitHub and SciCrunch - might be worth including them in the script. Yes, 1.4M paper QIDs and ~1.4k software QIDs are certainly enough to get started.
How to prioritize? We could go
- by paper
- randomly
- by some specific measure, e.g. number of annotations with some key properties like
- P921 (main subject)
- P50 (author)
- P4510 (describes a project that uses)
- P2860 (cites) and perhaps the inverse, i.e. classical citation counts
- P577 (publication date)
- by software
- randomly
- by some specific measure, e.g. number of annotations with some key properties like
- P275 (copyright license)
- P348 (software version identifier)
- P1324 (source code repository)
- P277 (programming language)
- by author
- by topic etc.
I was thinking to exhaustively go through the list and generate a list of triples
paperQID,P4510,softwareQID
and then we can figure out how to import the list of triples after?
Isn't "stated in" implied by adding it to the paper entry?
Object named as is smart :)
"stated in" would point to the CZI dataset. We are using this as our source, and there is the possibility that they have misinterpreted what is written in the paper, e.g. in case of homonyms.
Ah good call!
At our most conservative we have 71.1k triples :) I've only included the curated, top 1k software packages in this generation. I think I want to figure out how to add P1932 and P248 before starting step 4: importing to Wikidata
First ten rows for interest:
Q112289793,P4510,Q326489
Q64974740,P4510,Q113334665
Q64084169,P4510,Q104854189
Q64084169,P4510,Q326489
Q112717301,P4510,Q113018293
Q112610188,P4510,Q112236343
Q112298293,P4510,Q326489
Q112298293,P4510,Q22442795
Q112298293,P4510,Q1026367
Q112289158,P4510,Q113047099
In QuickStatements V1 format, P1932 and P248 can be brought in as follows:
Q112289793|P4510|Q326489|P1932|"ggplot2"|S248|Q114078827
I've generated them in quickstatements v1 format. First ten rows:
Q112289793|P4510|Q326489|P1932|"ggplot2"|S248|Q114078827
Q64974740|P4510|Q113334665|P1932|"DART"|S248|Q114078827
Q64084169|P4510|Q104854189|P1932|"dplyr"|S248|Q114078827
Q64084169|P4510|Q326489|P1932|"ggplot2"|S248|Q114078827
Q112717301|P4510|Q113018293|P1932|"DESeq"|S248|Q114078827
Q112610188|P4510|Q112236343|P1932|"limma"|S248|Q114078827
Q112298293|P4510|Q326489|P1932|"R package ggplot2"|S248|Q114078827
Q112298293|P4510|Q22442795|P1932|"scikit-image"|S248|Q114078827
Q112298293|P4510|Q1026367|P1932|"scikit-learn"|S248|Q114078827
Q112289158|P4510|Q113047099|P1932|"MSstats"|S248|Q114078827
I also generated for the software which is non-curated, but this is a bit riskier, last 10 rows of that file:
Q92422791|P4510|Q107382801|P1932|"Google"|S248|Q114078827
Q104485273|P4510|Q107382801|P1932|"Google"|S248|Q114078827
Q104140773|P4510|Q113334690|P1932|"edgeR"|S248|Q114078827
Q104801337|P4510|Q326489|P1932|"ggplot2"|S248|Q114078827
Q37628415|P4510|Q112236343|P1932|"Linear Models for Microarray Data (limma)"|S248|Q114078827
Q104614211|P4510|Q107381604|P1932|"meta"|S248|Q114078827
Q64099316|P4510|Q113018293|P1932|"R package DESeq2"|S248|Q114078827
Q26774199|P4510|Q326489|P1932|"ggplot2"|S248|Q114078827
Q90194454|P4510|Q326489|P1932|"ggplot2"|S248|Q114078827
Q58608333|P4510|Q107382801|P1932|"Google"|S248|Q114078827
Details
The first dataset I linked above is only things classified as software
todo: I should exclude things classified as not_software in this broader file if we are to use it
I had a look at the first file and am wondering whether you were checking for existing paperQID|P4510|softwareQID
statements. For instance, Q28383404|P4510|Q326489|P1932|"ggplot2"
is already in, just with a different reference.
I am fine having both references - just wondering whether your workflow takes existing statements into account in any way.
Nope I haven't pulled this data yet, I can do! What does Wikidata do if you try to add an already existing statement?
If the reference is different, the statement will be added. If everything is the same, QuickStatements might filter it out - I am going to do a quick test.
Batch 99632 running the command
Q15397819|P4510|Q113334665|P1932|"DART"|S248|Q114078827
yielded https://www.wikidata.org/w/index.php?title=Q15397819&oldid=1739284446#P4510 :
Running batch 99633 afterwards did not result in any error nor any edits:
Now, when changing the reference to
Q15397819|P4510|Q113334665|P1932|"DART"|S248|Q229883
as per batch 99635,
this results in one edit that adds an additional reference to the existing referenced statement:
Apart from thinking about cases with existing P4510 statements, we should also look into software for which Wikidata has none, e.g. as currently the case with phyloseq (Q106407822), as per https://w.wiki/5m4B :
So I am going to run a QS batch for that now:
curl https://raw.githubusercontent.com/carlinmack/qid-id/main/qsv1.csv |grep P4510\|Q106407822\|P1932 > phyloseq.qs
curl https://quickstatements.toolforge.org/api.php \
-d action=import \
-d submit=1 \
-d username=Research_Bot \
-d "batchname=Usage of phyloseq (Q106407822) according to CZI Software Mentions (Q114078827)" \
--data-raw 'token=REDACTED' \
--data-urlencode [email protected]
This batch is now running as https://quickstatements.toolforge.org/#/batch/99636 .
Above, I should have made it more explicit that batch 99633 ran the exact same command as 99632, i.e. Q15397819|P4510|Q113334665|P1932|"DART"|S248|Q114078827
.
Nice! Very satisfying to see the change in https://w.wiki/5m4B
Here is the Scholia /use/ profile for phyloseq right after the batch finished: https://scholia.toolforge.org/use/Q106407822 .
Amongst other things, the approximate date when the publication data was downloaded can be inferred.
I also checked for papers that have phyloseq in their title and gave the five I found a P921 (main subject) tag for it: https://quickstatements.toolforge.org/#/batch/99647 .
On that basis, the /topic/ profile looks like this:
I set up four more batches:
Quick overview of the most common strings in the qsv1.csv: curl https://raw.githubusercontent.com/carlinmack/qid-id/main/qsv1.csv |tr -c '[:alnum:]' '[\n*]' | sort | uniq -c | sort -nr | head -50
152696
75980 S248
75980 Q114078827
75980 P4510
75980 P1932
14833 Q113018293
12620 Q112236343
12349 Q326489
11704 ggplot2
8861 Q113334690
8763 limma
8623 DESeq2
6874 edgeR
5687 package
4918 R
4394 DESeq
3809 WGCNA
3809 Q102537983
3656 Q1026367
2967 learn
2605 Limma
2361 scikit
2047 Q106407822
1987 EdgeR
1743 Q113334509
1557 phyloseq
1343 Seurat
1343 Q85699649
1252 LIMMA
1236 affy
1021 Q104854189
1021 dplyr
998 Q114076742
980 Scikit
810 MIRA
721 Q113334751
721 GSVA
720 Q114077514
716 DEseq2
645 ggplot
637 Q114076291
637 DEGseq
507 Affy
490 Phyloseq
471 RTCA
471 Q114077291
466 sva
451 Q113334816
451 minfi
447 DEseq
I am running batch for dplyr too: https://quickstatements.toolforge.org/#/batch/99667 .
I have considered running one for Seurat but while inspecting some random data points here,
the line with
Q98783009|P4510|Q85699649|P1932|"Seurat"|S248|Q114078827
indicates that this needs some more thought, since the only mentions of anything Seurat*
in the article are as Seurat2
and Seurat 2
.