openVirus icon indicating copy to clipboard operation
openVirus copied to clipboard

Investigate DOAJ

Open petermr opened this issue 4 years ago • 13 comments

DOAJ is the world's largest collection of Open Access Journals. I was pointed at it . I think it may have significant content. Includes Redalyc - Arianna is on the board.

petermr avatar Apr 01 '20 17:04 petermr

First try... Use their query manually - get 74 hits for "n95". If only use approved journals get 37. Copy URL and try with curl.

pm286macbook:doaj pm286$ curl -o n95 https://doaj.org/search?source=%7B%22query%22%3A%7B%22filtered%22%3A%7B%22filter%22%3A%7B%22bool%22%3A%7B%22must%22%3A%5B%7B%22term%22%3A%7B%22index.has_seal.exact%22%3A%22Yes%22%7D%7D%5D%7D%7D%2C%22query%22%3A%7B%22query_string%22%3A%7B%22query%22%3A%22n95%22%2C%22default_operator%22%3A%22AND%22%7D%7D%7D%7D%7D
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   178    0   178    0     0    843      0 --:--:-- --:--:-- --:--:--   839
pm286macbook:doaj pm286$ ls
n95
pm286macbook:doaj pm286$ tree n95
n95 [error opening dir]

0 directories, 0 files
pm286macbook:doaj pm286$ more n95 
<html>
<head><title>403 Forbidden</title></head>
<body bgcolor="white">
<center><h1>403 Forbidden</h1></center>
<hr><center>nginx/1.14.0 (Ubuntu)</center>
</body>
</html>
n95 (END)

no idea what is forbidden? Maybe they only support dumps...

petermr avatar Apr 01 '20 17:04 petermr

~They have a 3.6GB download of the article level metadata, including URLs: https://doaj.org/public-data-dump~ Sorry misread your text.

anjackson avatar Apr 01 '20 20:04 anjackson

I think they're set up for you to use the API, e.g. this searchers abstracts for 'n95':

curl -X GET --header "Accept: application/json" "https://doaj.org/api/v1/search/articles/bibjson.abstract%3A%22n95%22"

See https://doaj.org/api/v1/docs#!/Search/get_api_v1_search_articles_search_query

anjackson avatar Apr 01 '20 20:04 anjackson

Thanks, Yes - and they've got ?a fulltext? dump pf hundreds of Gigs. Do you know anything about these? I'll probably get the metadata. I bought a 2Tb disk before the lockdown.

On Wed, Apr 1, 2020 at 9:06 PM Andy Jackson [email protected] wrote:

That have a 3.6GB download of the article level metadata, including URLs: https://doaj.org/public-data-dump

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/petermr/openVirus/issues/32#issuecomment-607463466, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS4QBO3PDKN3METV2RTRKONEZANCNFSM4LZJOX6A .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

petermr avatar Apr 01 '20 20:04 petermr

The DOAJ dump is abstracts only I think. the CORE dump is much larger and includes full text. I'm downloading it but it'll take days (it's 300GB!).

anjackson avatar Apr 01 '20 20:04 anjackson

YAY! Does it require an unbroken connection?

So... The DOAJ indexes the abstracts (and probably the title). The CORE dump doesn't index anything, so we need the SOLR.

petermr avatar Apr 01 '20 20:04 petermr

Ah looks good. Will try tomorrow.

On Wed, Apr 1, 2020 at 9:10 PM Andy Jackson [email protected] wrote:

I think they're set up for you to use the API, e.g. this searchers abstracts for 'n95':

curl -X GET --header "Accept: application/json" "https://doaj.org/api/v1/search/articles/bibjson.abstract%3A%22n95%22"

See https://doaj.org/api/v1/docs#!/Search/get_api_v1_search_articles_search_query

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/petermr/openVirus/issues/32#issuecomment-607465092, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCSZWIYUYGIG2LSOFBBDRKONTHANCNFSM4LZJOX6A .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

petermr avatar Apr 01 '20 22:04 petermr

I was about to comment that this is an ideal problem for Solr, much more so than running ad-hoc searches and indexing those. I'll investigate and get onto it.

deadlyvices avatar Apr 02 '20 06:04 deadlyvices

Absolutely right. We actually need both. If I get it right SOLR will do a high volume generic index. Then we use specific dictionaries. Andy has I think run AMI over 8000 thesis abstracts and found 50-100 which have virus terms. We could use SOLR to triage to a few hundred viral papers and then let people use AMI on those.

Andy, Will you be able to document what you have done and commit the data?

On Thu, Apr 2, 2020 at 7:48 AM Clyde Davies [email protected] wrote:

I was about to comment that this is an ideal problem for Solr, much more so than running ad-hoc searches and indexing those. I'll investigate and get onto it.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/petermr/openVirus/issues/32#issuecomment-607655293, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCSYAHYB4GQNQ6QBM2HLRKQYMFANCNFSM4LZJOX6A .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

petermr avatar Apr 02 '20 07:04 petermr

I'm downloading the core dump right now onto my Azure VM. Will untar it and see how big it is

deadlyvices avatar Apr 02 '20 07:04 deadlyvices

Wow! Exciting. Any ETA? Are you able to get a sneak preview of the content? UTF-8? HTML, ?JSON? PDF? My guess is it's flat text, without style.

On Thu, Apr 2, 2020 at 9:00 AM Clyde Davies [email protected] wrote:

I'm downloading the core dump right now onto my Azure VM. Will untar it and see how big it is

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/petermr/openVirus/issues/32#issuecomment-607685634, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCSYXEDVSMQRG2LZD2QTRKRAYRANCNFSM4LZJOX6A .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

petermr avatar Apr 02 '20 09:04 petermr

It's all JSON, which is fine because Solr handles that without any problems. Just tried indexing it, and it appears to have failed because of field errors. I will need to define a schema before we can do that.

deadlyvices avatar Apr 02 '20 09:04 deadlyvices

@petermr The results from the EThOS sample are at #36

anjackson avatar Apr 02 '20 11:04 anjackson