openVirus icon indicating copy to clipboard operation
openVirus copied to clipboard

Scraper for biorxiv and medrxiv

Open petermr opened this issue 5 years ago • 1 comments
trafficstars

PMR has already written a scraper but it's not optimal and needs cleaning.

More later

petermr avatar Mar 13 '20 23:03 petermr

This is organized as a picocli commandline (as is almost all AMI). My current style is to develop new functionalities as Tests, based on commandline and then add this to the JAR. Here's the first test:

	public void testBiorxivSmall() throws Exception {
		
		File target = new File("target/biorxiv1");
		FileUtils.deleteDirectory(target);
		MatcherAssert.assertThat(target+" does not exist", !target.exists());
		String args = 
				"-p " + target
				+ " --site biorxiv" // the type of site 
				+ " --query coronavirus" // the query
				+ " --pagesize 1" // size of remote pages (may not always work)
				+ " --pages 1 1" // number of pages
				+ " --resultset raw clean"
				+ " --landingpage "
				+ " --fulltext html pdf"
//				+ " --limit 500"  // total number of downloaded results
			;
		new AMIDownloadTool().runCommands(args);

This should translate to (where is the local directory and pagesize would normally be larger (e.g. 25)

ami-download -p  <target> --site biorxiv --query coronavirus --pagesize 25 --pages 1 1 \ 
 --resultset raw clean  --landingpage --fulltext html pdf --limit 500

Please try this. And try some of the others. NOTE: some of the test files may be in my local directory and need transferring to src/test/resource/ . This was to save space in the JAR and repo.

petermr avatar Mar 28 '20 13:03 petermr