scholia Can we use "A large dataset of software mentions in the biomedical literature"

Can we use "A large dataset of software mentions in the biomedical literature"

Open fnielsen opened this issue 1 year ago • 54 comments

Context

https://arxiv.org/pdf/2209.00693.pdf

Question

Can we use "A large dataset of software mentions in the biomedical literature"

Sep 11 '22 11:09 fnielsen

Yes. It's CC0 and supposed to be released next week.

Sep 19 '22 14:09 Daniel-Mietchen

it's available https://github.com/chanzuckerberg/software-mentions/issues/2?notification_referrer_id=NT_kwDOAGXha7I0NDA2MDkwMTMyOjY2NzY4NDM#event-7416509765 :)

Sep 21 '22 12:09 carlinmack

To be clear the idea would be to:

Filter the dataset to those papers which are in Wikidata
Find Wikidata IDs for the software which is mentioned
Add P4510 (describes a project that uses) on the papers to the QID of the software

At each step batching will have to be performed to not overwhelm WQDS. Each of these steps sounds doable to me :)

We're not adding papers or software, right? Perhaps that would be a subsequent project after this import

Sep 25 '22 20:09 carlinmack

I think that would be a good plan. I do not think we are adding papers now.

Sep 26 '22 06:09 fnielsen

Step 1 done https://github.com/carlinmack/qid-id/blob/main/qid-doi-pmc-pm.csv

It's a 1.4 milllion line long csv collating the IDs under each QID. Note this is a replica of the three identifiers we have in Wikidata for each item, rather than a collation of the identifiers from the dataset.

The code used in this process will be stored in that repository

Sep 27 '22 10:09 carlinmack

For step 2 we have 97k different pieces of software after disambiguation. Going to be interesting trying to find QIDs for them

Sep 27 '22 10:09 carlinmack

They provide identifiers for the pieces of software they have mined. Wikidata has properties for all of these identifiers, so one intermediate step could be to go through the respective repositories (e.g. Bioconductor), then check Wikidata for matching items and adding the corresponding property (e.g. Bioconductor project (P10892)). We could then use these properties to check against their data.

Another aspect of this is the creation of missing items, so it would be good to get an idea of

papers in their corpus that are not in Wikidata yet
software that is prominent in their corpus but missing on Wikidata

In both cases, it would be sufficient initially to just get the rough number, but eventually, we might want to have a more complete list, so we can think about creating some of those missing items.

Sep 27 '22 11:09 Daniel-Mietchen

A bit over half of the ca. 2k Bioconductor packages seem to be tagged as such already: https://w.wiki/5kWi .

Sep 27 '22 12:09 Daniel-Mietchen

Thanks for the pointer:) They have three indices they use: Bioconductor, CRAN and Pypi. There are 17,540 packages linked to these indices. Of these, there are 1,481 of these in Wikidata (554 Pypi, 913 Bioconductor and 14 CRAN).

I think it'll be best to restrict the project to the 1.4M paper QIDs we have and the 1.4k software QIDs we have for now

Sep 28 '22 10:09 carlinmack

They also checked GitHub and SciCrunch - might be worth including them in the script. Yes, 1.4M paper QIDs and ~1.4k software QIDs are certainly enough to get started.

How to prioritize? We could go

by paper
- randomly
- by some specific measure, e.g. number of annotations with some key properties like
  - P921 (main subject)
  - P50 (author)
  - P4510 (describes a project that uses)
  - P2860 (cites) and perhaps the inverse, i.e. classical citation counts
  - P577 (publication date)
by software
- randomly
- by some specific measure, e.g. number of annotations with some key properties like
  - P275 (copyright license)
  - P348 (software version identifier)
  - P1324 (source code repository)
  - P277 (programming language)
by author
by topic etc.

Sep 28 '22 13:09 Daniel-Mietchen

I was thinking to exhaustively go through the list and generate a list of triples

paperQID,P4510,softwareQID

and then we can figure out how to import the list of triples after?

Sep 28 '22 15:09 carlinmack

Ideally, we would add P1932 (object named as) and P248 (stated in) to that.

Sep 28 '22 18:09 Daniel-Mietchen

Isn't "stated in" implied by adding it to the paper entry?

Object named as is smart :)

Sep 28 '22 19:09 carlinmack

"stated in" would point to the CZI dataset. We are using this as our source, and there is the possibility that they have misinterpreted what is written in the paper, e.g. in case of homonyms.

Sep 28 '22 20:09 Daniel-Mietchen

Ah good call!

At our most conservative we have 71.1k triples :) I've only included the curated, top 1k software packages in this generation. I think I want to figure out how to add P1932 and P248 before starting step 4: importing to Wikidata

First ten rows for interest:

Q112289793,P4510,Q326489
Q64974740,P4510,Q113334665
Q64084169,P4510,Q104854189
Q64084169,P4510,Q326489
Q112717301,P4510,Q113018293
Q112610188,P4510,Q112236343
Q112298293,P4510,Q326489
Q112298293,P4510,Q22442795
Q112298293,P4510,Q1026367
Q112289158,P4510,Q113047099

Sep 28 '22 20:09 carlinmack

In QuickStatements V1 format, P1932 and P248 can be brought in as follows:

Q112289793|P4510|Q326489|P1932|"ggplot2"|S248|Q114078827

Sep 28 '22 21:09 Daniel-Mietchen

I've generated them in quickstatements v1 format. First ten rows:

Q112289793|P4510|Q326489|P1932|"ggplot2"|S248|Q114078827
Q64974740|P4510|Q113334665|P1932|"DART"|S248|Q114078827
Q64084169|P4510|Q104854189|P1932|"dplyr"|S248|Q114078827
Q64084169|P4510|Q326489|P1932|"ggplot2"|S248|Q114078827
Q112717301|P4510|Q113018293|P1932|"DESeq"|S248|Q114078827
Q112610188|P4510|Q112236343|P1932|"limma"|S248|Q114078827
Q112298293|P4510|Q326489|P1932|"R package ggplot2"|S248|Q114078827
Q112298293|P4510|Q22442795|P1932|"scikit-image"|S248|Q114078827
Q112298293|P4510|Q1026367|P1932|"scikit-learn"|S248|Q114078827
Q112289158|P4510|Q113047099|P1932|"MSstats"|S248|Q114078827

I also generated for the software which is non-curated, but this is a bit riskier, last 10 rows of that file:

Q92422791|P4510|Q107382801|P1932|"Google"|S248|Q114078827
Q104485273|P4510|Q107382801|P1932|"Google"|S248|Q114078827
Q104140773|P4510|Q113334690|P1932|"edgeR"|S248|Q114078827
Q104801337|P4510|Q326489|P1932|"ggplot2"|S248|Q114078827
Q37628415|P4510|Q112236343|P1932|"Linear Models for Microarray Data (limma)"|S248|Q114078827
Q104614211|P4510|Q107381604|P1932|"meta"|S248|Q114078827
Q64099316|P4510|Q113018293|P1932|"R package DESeq2"|S248|Q114078827
Q26774199|P4510|Q326489|P1932|"ggplot2"|S248|Q114078827
Q90194454|P4510|Q326489|P1932|"ggplot2"|S248|Q114078827
Q58608333|P4510|Q107382801|P1932|"Google"|S248|Q114078827

Details

The first dataset I linked above is only things classified as software

todo: I should exclude things classified as not_software in this broader file if we are to use it

Sep 29 '22 20:09 carlinmack

I had a look at the first file and am wondering whether you were checking for existing paperQID|P4510|softwareQID statements. For instance, Q28383404|P4510|Q326489|P1932|"ggplot2" is already in, just with a different reference. I am fine having both references - just wondering whether your workflow takes existing statements into account in any way.

Sep 29 '22 21:09 Daniel-Mietchen

Nope I haven't pulled this data yet, I can do! What does Wikidata do if you try to add an already existing statement?

Sep 29 '22 21:09 carlinmack

If the reference is different, the statement will be added. If everything is the same, QuickStatements might filter it out - I am going to do a quick test.

Sep 29 '22 21:09 Daniel-Mietchen

Batch 99632 running the command

Q15397819|P4510|Q113334665|P1932|"DART"|S248|Q114078827

yielded https://www.wikidata.org/w/index.php?title=Q15397819&oldid=1739284446#P4510 :

Running batch 99633 afterwards did not result in any error nor any edits:

Now, when changing the reference to Q15397819|P4510|Q113334665|P1932|"DART"|S248|Q229883 as per batch 99635, this results in one edit that adds an additional reference to the existing referenced statement:

Screenshot from 2022-09-30 00-02-38

Screenshot from 2022-09-30 00-04-13

Sep 29 '22 22:09 Daniel-Mietchen

Apart from thinking about cases with existing P4510 statements, we should also look into software for which Wikidata has none, e.g. as currently the case with phyloseq (Q106407822), as per https://w.wiki/5m4B :

Screenshot from 2022-09-30 00-12-55

So I am going to run a QS batch for that now:

curl https://raw.githubusercontent.com/carlinmack/qid-id/main/qsv1.csv |grep P4510\|Q106407822\|P1932 > phyloseq.qs
curl https://quickstatements.toolforge.org/api.php \
	-d action=import \
	-d submit=1 \
	-d username=Research_Bot \
	-d "batchname=Usage of phyloseq (Q106407822) according to CZI Software Mentions (Q114078827)" \
	--data-raw 'token=REDACTED' \
	--data-urlencode [email protected]

Sep 29 '22 22:09 Daniel-Mietchen

This batch is now running as https://quickstatements.toolforge.org/#/batch/99636 .

Screenshot from 2022-09-30 00-22-23

Sep 29 '22 22:09 Daniel-Mietchen

Above, I should have made it more explicit that batch 99633 ran the exact same command as 99632, i.e. Q15397819|P4510|Q113334665|P1932|"DART"|S248|Q114078827.

Sep 29 '22 22:09 Daniel-Mietchen

Nice! Very satisfying to see the change in https://w.wiki/5m4B

Sep 30 '22 01:09 carlinmack

Here is the Scholia /use/ profile for phyloseq right after the batch finished: https://scholia.toolforge.org/use/Q106407822 .

Screenshot 2022-09-30 at 02-51-22 phyloseq - Scholia

Amongst other things, the approximate date when the publication data was downloaded can be inferred.

Sep 30 '22 01:09 Daniel-Mietchen

I also checked for papers that have phyloseq in their title and gave the five I found a P921 (main subject) tag for it: https://quickstatements.toolforge.org/#/batch/99647 .

On that basis, the /topic/ profile looks like this:

Screenshot 2022-09-30 at 02-51-38 phyloseq - Scholia

Sep 30 '22 01:09 Daniel-Mietchen

I set up four more batches:

Sep 30 '22 02:09 Daniel-Mietchen

152696 
  75980 S248
  75980 Q114078827
  75980 P4510
  75980 P1932
  14833 Q113018293
  12620 Q112236343
  12349 Q326489
  11704 ggplot2
   8861 Q113334690
   8763 limma
   8623 DESeq2
   6874 edgeR
   5687 package
   4918 R
   4394 DESeq
   3809 WGCNA
   3809 Q102537983
   3656 Q1026367
   2967 learn
   2605 Limma
   2361 scikit
   2047 Q106407822
   1987 EdgeR
   1743 Q113334509
   1557 phyloseq
   1343 Seurat
   1343 Q85699649
   1252 LIMMA
   1236 affy
   1021 Q104854189
   1021 dplyr
    998 Q114076742
    980 Scikit
    810 MIRA
    721 Q113334751
    721 GSVA
    720 Q114077514
    716 DEseq2
    645 ggplot
    637 Q114076291
    637 DEGseq
    507 Affy
    490 Phyloseq
    471 RTCA
    471 Q114077291
    466 sva
    451 Q113334816
    451 minfi
    447 DEseq

Sep 30 '22 06:09 Daniel-Mietchen

I am running batch for dplyr too: https://quickstatements.toolforge.org/#/batch/99667 .

I have considered running one for Seurat but while inspecting some random data points here, the line with Q98783009|P4510|Q85699649|P1932|"Seurat"|S248|Q114078827 indicates that this needs some more thought, since the only mentions of anything Seurat* in the article are as Seurat2 and Seurat 2.

Sep 30 '22 06:09 Daniel-Mietchen

scholia scholia copied to clipboard

Can we use "A large dataset of software mentions in the biomedical literature"

Context

Question

scholia
scholia copied to clipboard