openVirus
openVirus copied to clipboard
Investigate DOAJ
DOAJ is the world's largest collection of Open Access Journals. I was pointed at it . I think it may have significant content. Includes Redalyc - Arianna is on the board.
First try...
Use their query manually - get 74 hits for "n95". If only use approved journals get 37. Copy URL and try with curl
.
pm286macbook:doaj pm286$ curl -o n95 https://doaj.org/search?source=%7B%22query%22%3A%7B%22filtered%22%3A%7B%22filter%22%3A%7B%22bool%22%3A%7B%22must%22%3A%5B%7B%22term%22%3A%7B%22index.has_seal.exact%22%3A%22Yes%22%7D%7D%5D%7D%7D%2C%22query%22%3A%7B%22query_string%22%3A%7B%22query%22%3A%22n95%22%2C%22default_operator%22%3A%22AND%22%7D%7D%7D%7D%7D
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 178 0 178 0 0 843 0 --:--:-- --:--:-- --:--:-- 839
pm286macbook:doaj pm286$ ls
n95
pm286macbook:doaj pm286$ tree n95
n95 [error opening dir]
0 directories, 0 files
pm286macbook:doaj pm286$ more n95
<html>
<head><title>403 Forbidden</title></head>
<body bgcolor="white">
<center><h1>403 Forbidden</h1></center>
<hr><center>nginx/1.14.0 (Ubuntu)</center>
</body>
</html>
n95 (END)
no idea what is forbidden? Maybe they only support dumps...
~They have a 3.6GB download of the article level metadata, including URLs: https://doaj.org/public-data-dump~ Sorry misread your text.
I think they're set up for you to use the API, e.g. this searchers abstracts for 'n95':
curl -X GET --header "Accept: application/json" "https://doaj.org/api/v1/search/articles/bibjson.abstract%3A%22n95%22"
See https://doaj.org/api/v1/docs#!/Search/get_api_v1_search_articles_search_query
Thanks, Yes - and they've got ?a fulltext? dump pf hundreds of Gigs. Do you know anything about these? I'll probably get the metadata. I bought a 2Tb disk before the lockdown.
On Wed, Apr 1, 2020 at 9:06 PM Andy Jackson [email protected] wrote:
That have a 3.6GB download of the article level metadata, including URLs: https://doaj.org/public-data-dump
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/petermr/openVirus/issues/32#issuecomment-607463466, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS4QBO3PDKN3METV2RTRKONEZANCNFSM4LZJOX6A .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
The DOAJ dump is abstracts only I think. the CORE dump is much larger and includes full text. I'm downloading it but it'll take days (it's 300GB!).
YAY! Does it require an unbroken connection?
So... The DOAJ indexes the abstracts (and probably the title). The CORE dump doesn't index anything, so we need the SOLR.
Ah looks good. Will try tomorrow.
On Wed, Apr 1, 2020 at 9:10 PM Andy Jackson [email protected] wrote:
I think they're set up for you to use the API, e.g. this searchers abstracts for 'n95':
curl -X GET --header "Accept: application/json" "https://doaj.org/api/v1/search/articles/bibjson.abstract%3A%22n95%22"
See https://doaj.org/api/v1/docs#!/Search/get_api_v1_search_articles_search_query
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/petermr/openVirus/issues/32#issuecomment-607465092, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCSZWIYUYGIG2LSOFBBDRKONTHANCNFSM4LZJOX6A .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
I was about to comment that this is an ideal problem for Solr, much more so than running ad-hoc searches and indexing those. I'll investigate and get onto it.
Absolutely right. We actually need both. If I get it right SOLR will do a high volume generic index. Then we use specific dictionaries. Andy has I think run AMI over 8000 thesis abstracts and found 50-100 which have virus terms. We could use SOLR to triage to a few hundred viral papers and then let people use AMI on those.
Andy, Will you be able to document what you have done and commit the data?
On Thu, Apr 2, 2020 at 7:48 AM Clyde Davies [email protected] wrote:
I was about to comment that this is an ideal problem for Solr, much more so than running ad-hoc searches and indexing those. I'll investigate and get onto it.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/petermr/openVirus/issues/32#issuecomment-607655293, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCSYAHYB4GQNQ6QBM2HLRKQYMFANCNFSM4LZJOX6A .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
I'm downloading the core dump right now onto my Azure VM. Will untar it and see how big it is
Wow! Exciting. Any ETA? Are you able to get a sneak preview of the content? UTF-8? HTML, ?JSON? PDF? My guess is it's flat text, without style.
On Thu, Apr 2, 2020 at 9:00 AM Clyde Davies [email protected] wrote:
I'm downloading the core dump right now onto my Azure VM. Will untar it and see how big it is
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/petermr/openVirus/issues/32#issuecomment-607685634, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCSYXEDVSMQRG2LZD2QTRKRAYRANCNFSM4LZJOX6A .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
It's all JSON, which is fine because Solr handles that without any problems. Just tried indexing it, and it appears to have failed because of field errors. I will need to define a schema before we can do that.
@petermr The results from the EThOS sample are at #36