openVirus
openVirus copied to clipboard
`ami-download` and `AMIDownloadTool.runCommands(args) behave differently
trafficstars
public void testBiorxivSmall() throws Exception {
File target = new File("target/biorxiv1");
if (target.exists()) {FileUtils.deleteDirectory(target);}
MatcherAssert.assertThat(target+" does not exist", !target.exists());
String args =
"-p " + target
+ " --site biorxiv" // the type of site
+ " --query coronavirus" // the query
+ " --pagesize 1" // size of remote pages (may not always work)
+ " --pages 1 1" // number of pages
+ " --fulltext pdf html"
+ " --resultset raw clean"
// + " --limit 500" // total number of downloaded results
;
new AMIDownloadTool().runCommands(args);
Assert.assertTrue("target exists", target.exists());
// check for reserved and non-reserved child files
This should download 1 page of length 1. When run in Eclipse this gives:
eneric values (AMIDownloadTool)
================================
-v to see generic values
oldstyle true
0 [main] INFO org.contentmine.ami.tools.AMIDownloadTool - set output to: scraped/
0 [main] INFO org.contentmine.ami.tools.AMIDownloadTool - set output to: scraped/
project target/biorxiv1
Specific values (AMIDownloadTool)
================================
fulltext [pdf, html]
limit 2
metadata metadata
pages [1, 1]
pagesize 1
query [coronavirus]
resultSetList [raw, clean]
site biorxiv
Query: coronavirus%20sort%3Arelevance-rank%20numresults%3A1
URL https://www.biorxiv.org/search/coronavirus%20sort%3Arelevance-rank%20numresults%3A1
runing curl :https://www.biorxiv.org/search/coronavirus%20sort%3Arelevance-rank%20numresults%3A1 to target/biorxiv1/__metadata/resultSet1.html
wrote resultSet: /Users/pm286/workspace/cmdev/ami3/target/biorxiv1/__metadata/resultSet1.clean.html
getAuthors(); NYI
metadataEntries 1
Results 1
[target/biorxiv1/__metadata/resultSet1.clean.html]
download files in resultSet target/biorxiv1/__metadata/resultSet1.clean.html
result set: target/biorxiv1/__metadata/resultSet1.clean.html
getAuthors(); NYI
metadataEntries 1
download with curl to <tree>scrapedMetadata.html[/content/10.1101/2020.01.30.926477v1]
running batched up curlDownloader for 1 landingPages, takes ca 1-5 sec/page
ran curlDownloader for 1 landingPages
downloaded 1 files
skipped: 10_1101_2020_01_30_926477v1
running [curl, -X, GET, https://www.biorxiv.org/content/10.1101/2020.01.30.926477v1]
.writing to :/Users/pm286/workspace/cmdev/ami3/target/biorxiv1/10_1101_2020_01_30_926477v1/abstract.html
writing to :/Users/pm286/workspace/cmdev/ami3/target/biorxiv1/10_1101_2020_01_30_926477v1/fulltext.html
writing to :/Users/pm286/workspace/cmdev/ami3/target/biorxiv1/10_1101_2020_01_30_926477v1/fulltext.pdf
target/biorxiv1
target/biorxiv1/10_1101_2020_01_30_926477v1/abstract.html
target/biorxiv1/10_1101_2020_01_30_926477v1/fulltext.html
target/biorxiv1/10_1101_2020_01_30_926477v1/fulltext.pdf
target/biorxiv1/10_1101_2020_01_30_926477v1/landingPage.html
target/biorxiv1/10_1101_2020_01_30_926477v1/resultSet.html
target/biorxiv1/10_1101_2020_01_30_926477v1/scrapedMetadata.html
target/biorxiv1/__metadata/resultSet1.clean.html
target/biorxiv1/__metadata/resultSet1.html
and terminates
The output:
eneric values (AMIDownloadTool)
================================
-v to see generic values
oldstyle true
0 [main] INFO org.contentmine.ami.tools.AMIDownloadTool - set output to: scraped/
0 [main] INFO org.contentmine.ami.tools.AMIDownloadTool - set output to: scraped/
project target/biorxiv1
Specific values (AMIDownloadTool)
================================
fulltext [pdf, html]
limit 2
metadata metadata
pages [1, 1]
pagesize 1
query [coronavirus]
resultSetList [raw, clean]
site biorxiv
Query: coronavirus%20sort%3Arelevance-rank%20numresults%3A1
URL https://www.biorxiv.org/search/coronavirus%20sort%3Arelevance-rank%20numresults%3A1
runing curl :https://www.biorxiv.org/search/coronavirus%20sort%3Arelevance-rank%20numresults%3A1 to target/biorxiv1/__metadata/resultSet1.html
wrote resultSet: /Users/pm286/workspace/cmdev/ami3/target/biorxiv1/__metadata/resultSet1.clean.html
getAuthors(); NYI
metadataEntries 1
Results 1
[target/biorxiv1/__metadata/resultSet1.clean.html]
download files in resultSet target/biorxiv1/__metadata/resultSet1.clean.html
result set: target/biorxiv1/__metadata/resultSet1.clean.html
getAuthors(); NYI
metadataEntries 1
download with curl to <tree>scrapedMetadata.html[/content/10.1101/2020.01.30.926477v1]
running batched up curlDownloader for 1 landingPages, takes ca 1-5 sec/page
ran curlDownloader for 1 landingPages
downloaded 1 files
skipped: 10_1101_2020_01_30_926477v1
running [curl, -X, GET, https://www.biorxiv.org/content/10.1101/2020.01.30.926477v1]
.writing to :/Users/pm286/workspace/cmdev/ami3/target/biorxiv1/10_1101_2020_01_30_926477v1/abstract.html
writing to :/Users/pm286/workspace/cmdev/ami3/target/biorxiv1/10_1101_2020_01_30_926477v1/fulltext.html
writing to :/Users/pm286/workspace/cmdev/ami3/target/biorxiv1/10_1101_2020_01_30_926477v1/fulltext.pdf
target/biorxiv1
target/biorxiv1/10_1101_2020_01_30_926477v1/abstract.html
target/biorxiv1/10_1101_2020_01_30_926477v1/fulltext.html
target/biorxiv1/10_1101_2020_01_30_926477v1/fulltext.pdf
target/biorxiv1/10_1101_2020_01_30_926477v1/landingPage.html
target/biorxiv1/10_1101_2020_01_30_926477v1/resultSet.html
target/biorxiv1/10_1101_2020_01_30_926477v1/scrapedMetadata.html
target/biorxiv1/__metadata/resultSet1.clean.html
target/biorxiv1/__metadata/resultSet1.html
When run on comandline it gives:
pm286macbook:ami3 pm286$ ami-download -p target/biorxiv --site biorxiv --query coronavirus --pagesize 1 --pages 1 1 --fulltext pdf html --resultset raw clean
Generic values (AMIDownloadTool)
================================
-v to see generic values
oldstyle true
0 [main] INFO org.contentmine.ami.tools.AMIDownloadTool - set output to: scraped/
0 [main] INFO org.contentmine.ami.tools.AMIDownloadTool - set output to: scraped/
project target/biorxiv
Specific values (AMIDownloadTool)
================================
fulltext [pdf, html]
limit 2
metadata metadata
pages [1, 1]
pagesize 1
query [coronavirus]
resultSetList [raw, clean]
site biorxiv
Query: coronavirus%20sort%3Arelevance-rank%20numresults%3A1
URL https://www.biorxiv.org/search/coronavirus%20sort%3Arelevance-rank%20numresults%3A1
runing curl :https://www.biorxiv.org/search/coronavirus%20sort%3Arelevance-rank%20numresults%3A1 to target/biorxiv/__metadata/resultSet1.html
wrote resultSet: /Users/pm286/workspace/cmdev/ami3/target/biorxiv/__metadata/resultSet1.clean.html
getAuthors(); NYI
metadataEntries 1
Results 1
[target/biorxiv/__metadata/resultSet1.clean.html, target/biorxiv/__metadata/resultSet10.clean.html, target/biorxiv/__metadata/resultSet2.clean.html, target/biorxiv/__metadata/resultSet3.clean.html, target/biorxiv/__metadata/resultSet4.clean.html, target/biorxiv/__metadata/resultSet5.clean.html, target/biorxiv/__metadata/resultSet6.clean.html, target/biorxiv/__metadata/resultSet7.clean.html, target/biorxiv/__metadata/resultSet8.clean.html, target/biorxiv/__metadata/resultSet9.clean.html]
download files in resultSet target/biorxiv/__metadata/resultSet1.clean.html
result set: target/biorxiv/__metadata/resultSet1.clean.html
getAuthors(); NYI
metadataEntries 1
download with curl to <tree>scrapedMetadata.html[/content/10.1101/2020.01.30.926477v1]
running batched up curlDownloader for 1 landingPages, takes ca 1-5 sec/page
ran curlDownloader for 1 landingPages
downloaded 1 files
download files in resultSet target/biorxiv/__metadata/resultSet10.clean.html
result set: target/biorxiv/__metadata/resultSet10.clean.html
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
getAuthors(); NYI
metadataEntries 40
download with curl to <tree>scrapedMetadata.html[/content/10.1101/581512v2, /content/10.1101/856518v1, /content/10.1101/2020.01.24.919282v1, /content/10.1101/732255v3, /content/10.1101/800300v1, /content/10.1101/2020.02.06.936302v3, /content/10.1101/840090v2, /content/10.1101/353037v1, /content/10.1101/2020.01.12.902452v1, /content/10.1101/606715v1, /content/10.1101/695510v1, /content/10.1101/2020.01.10.901801v1, /content/10.1101/599043v1, /content/10.1101/094623v1, /content/10.1101/271171v2, /content/10.1101/2020.03.09.984393v1, /content/10.1101/2020.03.07.982207v1, /content/10.1101/780841v1, /content/10.1101/326546v1, /content/10.1101/676155v1, /content/10.1101/2019.12.18.880849v1, /content/10.1101/777847v2, /content/10.1101/2019.12.20.885590v1, /content/10.1101/2020.02.26.966143v1, /content/10.1101/402800v1, /content/10.1101/2019.12.16.875872v2, /content/10.1101/2020.02.16.946699v1, /content/10.1101/498998v1, /content/10.1101/2020.02.10.942847v1, /content/10.1101/623819v1, /content/10.1101/485060v1, /content/10.1101/476341v1, /content/10.1101/2020.04.02.020081v1, /content/10.1101/548909v1, /content/10.1101/2020.03.25.007534v1, /content/10.1101/2020.01.09.900555v1, /content/10.1101/812313v1, /content/10.1101/804716v1, /content/10.1101/2019.12.21.885921v2, /content/10.1101/296996v1]
running batched up curlDownloader for 40 landingPages, takes ca 1-5 sec/page
It ignores the page restrictions and starts downloading everything.