openVirus icon indicating copy to clipboard operation
openVirus copied to clipboard

Data visualisation

Open TS404 opened this issue 4 years ago • 16 comments

I'm a big fan of a good visualisation. I'm going to start thinking about some possible visualisations using [R] and Shinyapps. Any assistance and ideas welcomed on possible individual or combined visualisations for:

  • Topics, themes, findings
  • Citations
  • Authors
  • Changes over time
  • Others?

Some existing examples:

Obviously networks and multidimensional scaling projections could be useful. Also probably circos plots between themes?

TS404 avatar Apr 07 '20 10:04 TS404

Very keen on this. If you can make this appeal to a citizen audience that would be great. Citations and Authors may be seen as niche academic subjects whereas themes in everyday discourse (respirator, social distance) are likely to engage people.

petermr avatar Apr 08 '20 17:04 petermr

It's possible to do simple static diagrams via [R] packages like igraph.

A major bonus, however, might be interactive/responsive graphics. I've tested out the networkD3 and chorddiag packages (both of which are based on D3.js run via the R2D3 package). See

These should at least be sufficient to adapt for grouping and displaying sets articles based on coauthors, citations or topics.

Ideally, eventually would love to use the bundle variant of a chord diagram (example1 or example2, tutorial.

Initial tests of network for some covid authors:

image

TS404 avatar Apr 14 '20 12:04 TS404

Great!

On Tue, Apr 14, 2020 at 1:30 PM Thomas Shafee [email protected] wrote:

It's possible to do simple static diagrams via [R] packages like igraph https://igraph.org/r/.

A major bonus, however, might be interactive/responsive graphics. I've tested out the networkD3 https://christophergandrud.github.io/networkD3 and chorddiag https://github.com/mattflor/chorddiag packages (both of which are based on D3.js https://d3js.org/ run via the R2D3 package https://rstudio.github.io/r2d3/articles/gallery.html). See

Excellent. This could work for cooccurrences - e.g. in Counties or diseases. I have just created (but not pushed) the first extraction of biorxiv700 (695 papers) - due to coding bugs there are only 600.

These should at least be sufficient to adapt for grouping and displaying sets articles based on coauthors, citations or topics.

Ideally, eventually would love to use the bundle variant of a chord diagram (example1 https://observablehq.com/@d3/hierarchical-edge-bundling or example2 https://observablehq.com/@d3/hierarchical-edge-bundling/2, tutorial https://www.youtube.com/watch?v=ROflkF1CVhI.

Initial tests of network for some covid authors:

Excellent.

How do you want to receive the data?

P.

[image: image]

https://user-images.githubusercontent.com/10216013/79225072-86853c80-7e9f-11ea-91cf-1e2c68071adf.png

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/petermr/openVirus/issues/40#issuecomment-613414512, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS6JBNLRR3O64CUGH5LRMRJPJANCNFSM4MC73AVA .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

petermr avatar Apr 14 '20 13:04 petermr

I've now added the visualisation code to wikiPackageTesting.R in the #Visualisations section

Data options:

  • Easiest: I should be able to read the html tables (e.g. full.dataTables.html as a matrix of article vs (author / topic / citing article) in any tabular format (csv, tsv whatever) should be sufficient to import.
  • Most size-efficient: a table of edges (start, end, weight for each) and a table of nodes (name and properties for each) per MisLins and MisNodes here.
  • Ideal: Store all info in wikidata, where I can then pull via SPARQL e.g. all publications with a main subject (P921) of covid-19 (Q84263196), SARS-CoV-2 (Q82069695), Coronavirus (Q290805) etc along with their other topics, authors, citations, etc. e.g:
SELECT DISTINCT ?work ?workLabel ?pdate ?topic ?topicLabel ?author1 ?author1Label ?citing_work WHERE {
  VALUES ?topics { wd:Q82069695 wd:Q84263196 wd:Q81068910 }
  ?work wdt:P31 wd:Q13442814;
    wdt:P921 ?topics.
  OPTIONAL { ?work wdt:P577 ?pdate. }
  OPTIONAL { ?work wdt:P921 ?topic. }
  OPTIONAL { ?work wdt:P50  ?author1. }
  OPTIONAL { ?citing_work wdt:P2860 ?work. }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
GROUP BY ?pdate ?work ?workLabel  ?topic ?topicLabel ?author1 ?author1Label ?citing_work

TS404 avatar Apr 15 '20 10:04 TS404

Thanks @TS404

I've now added the visualisation code to wikiPackageTesting.R in the #Visualisations section

Well done. Can you give some screen shots?

Data options:

Easiest: I should be able to read the html tables (e.g. full.dataTables.html as a matrix of article vs (author / topic / citing article) in any tabular format (csv, tsv whatever) should be sufficient to import.

That should be possible. Note there are usually many entries in a facet-cell. If you are just looking at bibliographic data we may manage things. There are multiple authors per article. How do we manage that?

And what is a "citing" article? we don't have, and won't have , a citation graph.

Most size-efficient: a table of edges (start, end, weight for each) and a table of nodes (name and properties for each) per MisLins and MisNodes here.

Don't understand where these edges come from, and what a MisLin or MisNode.

Ideal: Store all info in wikidata, where I can then pull via SPARQL e.g. all publications with a main subject (P921) of covid-19 (Q84263196), SARS-CoV-2 (Q82069695), Coronavirus (Q290805) etc along with their other topics, authors, citations, etc. e.g:

That would be great. Presumably a questioon of getting this accepted by Wikidata-ns , but DanielM put millions of bibliographic references into Wikidata.

(HEY! we should be adding QIDs for publications. That would be great!)

Note also that I have not got pointers back to Biorxiv working properly.

Are they putting preprints into Wikidata?

petermr avatar Apr 15 '20 12:04 petermr

When I worked at AZ the New Opportunities group did a visualization where they examined the author list, and then ranked them by first author, last author and secondary contributor count in papers. It was a triangular plot, as you'd use for a three component phase diagram.. I think they call it a 'ternary plot': image So we could do that for a particular topic search and then get to identify the key opinion leaders.

deadlyvices avatar Apr 15 '20 13:04 deadlyvices

Note that we haven't got a simple approach to bibliography. We can do JATS from EPMC . JATS is not always much fun as there can be authorstrings (i.e. all authors run together) and disambiguation (no ORCIDs). What is the driver for this? I suspect academics will use it but who else?

petermr avatar Apr 15 '20 13:04 petermr

I think it's useful to know who is helping to lead an area of investigation. I've been playing around and have been able to generate the percentages of publications for each author as first author, last author and other. Spotfire doesn't do ternary plots, so I generated a ||el coordinate plot: image

deadlyvices avatar Apr 15 '20 15:04 deadlyvices

So it's possible to generate the input

deadlyvices avatar Apr 15 '20 15:04 deadlyvices

Data format and storage

The facet cell listing multiple items is fine (essentially I'll aim to turn it into a nested list in [R]). Similarly, ideally there should be a column listing all the authors of a publication (disambiguating to QIDs will be the greatest challenge) but as plaintext strings is fine as a backup.

I've checked over at Wikidata's Wikiproject COVID-19 and it seems there are already a few hundred preprints already listed in wikidata, so it shouldn't be too controversial to add all the covid-relevant ones (and eventually others).

Visualisations

Visualisations focusing on topics and publications has the clearest immediate public value to show where the main research threads are heading.

Visualisations focusing on authors can demonstrate which authors are collaborative (and which are in silos) and in what roles and can help researchers to identify people to watch or contact for collaboration. I like the idea of separating first/middle/last if possible (like this query).

I've done a bit more stress-testing of the code for networks of different sizes (e.g. see Anthony Fauci's co-author network below). Next step, I'll start tweaking it to make the nodes=publications and the links=topic_similarity.

image Anthony Fauci's co-author network, larger circles and thicker lines indicate people he's co-authored more with. For interactive version, see WDNetworkVis.nb.html

TS404 avatar Apr 16 '20 12:04 TS404

Ok, so I've managed to get the concomitant co-topic graph working reasonably robustly!

In order to make it interactive, I've built a simple shiny app. It works locally fine locally, but the version on shinyapps.io seems to still be having problems (I've left a query on stack overflow).

  • Website: https://ts404.shinyapps.io/topicnetwork
  • Code: https://github.com/TS404/TopicNetwork

Once I've managed to get it properly working online, next steps for the visualisation:

  1. Take the topics graph from biorxiv700/full.dataTables.html as the input rather than only wikidata
  2. Present chord diagram as well
  3. Improve the click actions a. select node to list publications on that topic? b. select multiple nodes to subset? c. easy navigation to wikidata/publication/scholia d. loading time indicator? (larger wikidata queries can take >30)

image Local instance of TS404/topicnetwork. image Same data visualised as chord diagram (not yet included in TS404/topicnetwork).

TS404 avatar Apr 19 '20 09:04 TS404

Well done! I also get "Disconnected from the server" - is that the problem? (Chrome)

On Sun, Apr 19, 2020 at 10:54 AM Thomas Shafee [email protected] wrote:

Ok, so I've managed to get the concomitant co-topic graph working reasonably robustly!

In order to make it interactive, I've built a simple shiny app. It works locally fine locally, but the version on shinyapps.io seems to still be having problems (I've left a query on stack overflow https://stackoverflow.com/questions/61301407/immediate-disconnect-from-server-in-shinyapps-local-working-no-errors-reported ).

  • Website: https://ts404.shinyapps.io/topicnetwork
  • Code: https://github.com/TS404/TopicNetwork

Once I've managed to get it properly working online, next steps:

  1. Take the topics graph from openVirus as the input
  2. Present chord diagram as well
  3. Improve the click actions (e.g. select node to list publications on that topic, click multiple nodes to subset?)

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/petermr/openVirus/issues/40#issuecomment-616090973, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS4BN5CML7T6MTUTQLTRNLC5HANCNFSM4MC73AVA .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

petermr avatar Apr 19 '20 15:04 petermr

Ah... you seem to have got some suggestions on StackOverflow

On Sun, Apr 19, 2020 at 4:19 PM Peter Murray-Rust < [email protected]> wrote:

Well done! I also get "Disconnected from the server" - is that the problem? (Chrome)

On Sun, Apr 19, 2020 at 10:54 AM Thomas Shafee [email protected] wrote:

Ok, so I've managed to get the concomitant co-topic graph working reasonably robustly!

In order to make it interactive, I've built a simple shiny app. It works locally fine locally, but the version on shinyapps.io seems to still be having problems (I've left a query on stack overflow https://stackoverflow.com/questions/61301407/immediate-disconnect-from-server-in-shinyapps-local-working-no-errors-reported ).

  • Website: https://ts404.shinyapps.io/topicnetwork
  • Code: https://github.com/TS404/TopicNetwork

Once I've managed to get it properly working online, next steps:

  1. Take the topics graph from openVirus as the input
  2. Present chord diagram as well
  3. Improve the click actions (e.g. select node to list publications on that topic, click multiple nodes to subset?)

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/petermr/openVirus/issues/40#issuecomment-616090973, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS4BN5CML7T6MTUTQLTRNLC5HANCNFSM4MC73AVA .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

petermr avatar Apr 19 '20 15:04 petermr

The CJ Yetman comment fixed it! Try https://ts404.shinyapps.io/topicnetwork now! I'll have to test why the fix works to avoid re-introducing it later, but v. useful for now.

TS404 avatar Apr 20 '20 00:04 TS404

Updates to https://ts404.shinyapps.io/topicnetwork now enable it to report back the list of publications that are about a set of subjects. Currently picked based on checkboxes, but eventually I'd like it to be based on clicking the nodes.

TS404 avatar Apr 27 '20 11:04 TS404

This is fantastic.

(Small comments. It's somewhat slow computationally. And it's not easy to read the labels. But it shows new clusters. Excitin g.)

On Mon, Apr 27, 2020 at 12:02 PM Thomas Shafee [email protected] wrote:

Updates to https://ts404.shinyapps.io/topicnetwork now enable it to report back the list of publications that are about a set of subjects. Currently picked based on checkboxes, but eventually I'd like it to be based on clicking the nodes.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/petermr/openVirus/issues/40#issuecomment-619908521, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS23LCIHOKIOQ6G43RTROVQ6FANCNFSM4MC73AVA .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

petermr avatar Apr 27 '20 11:04 petermr