openVirus Documenting Testing Expanding AMIDownload

AMIDownloadTool is a wrapper for various ways of crawling scraping sites. The best developed is biorxiv . This is complex:

Manual search on biorxiv gives a hit list in HTML
we turn this into a single file ("ResultSet")
this set points to individual landing sets which we download in HTML
these then point to individual fulltext.html and fulltext.pdf files

Apr 05 '20 16:04 petermr

Lezan,. I created a small test in that class which needs manual checking. Does it give the same files as the manual search and click gives? Let's ask @anjackson what the right names are...

Apr 05 '20 16:04 petermr

I have had a look at AMIDownloadTest, these are the errors I found: A few seem to have an IllegalThreadStateException which i am not sure how to go about testBiorxivSmall() only fails because there is no variable for the landingpage argument on line 67 and its missing a comma between html and pdf in line 68

testBiorxivClimate could be a false assert statement, it's looking for a folder called metadata and file called page1.html, but it's created a __metadata folders instead with under it called resultSetX.html are those the same thing?

Apr 07 '20 18:04 l-hawizy

Current Issues in AMIDownloadTest (run through eclipse): Errors: testHALSearchResultSet(): picocli.CommandLine$ExecutionException: Error while calling command (org.contentmine.ami.tools.AMIDownloadTool@3a53c76a): java.lang.RuntimeException: nu.xom.ParsingException: The declaration for the entity "HTML.Version" must end with '>'. at line 31, column 3 in http://www.w3.org/TR/html4/loose.dtd testCreateUnpopulatedCTreesFromResultSet(): source dir target/biorxiv/climate doesnt exist because it was run before testBiorxivClimate testDownloadAndSearchLongIT(): missing argument for pagesize on line 566

Test Failures: testAMISearch(): not creating the testsearch3 dir testSections(): same as above testSearch(): same

testBiorxivClimate(): assertion error for landingPage (doesn't exist) testRelativeFile(): assertNotNull fails on file

Apr 08 '20 12:04 l-hawizy

Have @Ignore'd this test.

On Wed, Apr 8, 2020 at 1:38 PM l-hawizy [email protected] wrote:

Current Issues in AMIDownloadTest (run through eclipse): Errors: testHALSearchResultSet(): picocli.CommandLine$ExecutionException: Error while calling command (org.contentmine.ami.tools.AMIDownloadTool@3a53c76a): java.lang.RuntimeException: nu.xom.ParsingException: The declaration for the entity "HTML.Version" must end with '>'. at line 31, column 3 in http://www.w3.org/TR/html4/loose.dtd testCreateUnpopulatedCTreesFromResultSet(): source dir target/biorxiv/climate doesnt exist because it was run before testBiorxivClimate testDownloadAndSearchLongIT(): missing argument for pagesize on line 566

Test Failures: testAMISearch(): not creating the testsearch3 dir testSections(): same as above testSearch(): same

testBiorxivClimate(): assertion error for landingPage (doesn't exist) testRelativeFile(): assertNotNull fails on file

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/petermr/openVirus/issues/39#issuecomment-610934134, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS6OO5WQEMKPU5YK7P3RLRV4ZANCNFSM4MBDDG7A .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

Apr 08 '20 13:04 petermr

On naming (if I'm not too late/irrelevant), FWIW, this is the way I'd describe the usual flow:

From a search, we get multiple SearchResultsPages.
Combining these gets a Search Results Set.
Each will usually point to a set of Landing Pages (at least that what we call them at work).
Each Landing Page should point to the PDF (if it's open access), hopefully using a fairly standard meta tag in the HTML headers, e.g. citation_pdf_url or DC.identifier tags (section 2.F).
We then have to grab the fulltext.pdf and convert to scholarly.html.

Unless the papers are HTML fulltext, in which case there are usually no Landing Pages, I think.

Apr 09 '20 11:04 anjackson

openVirus openVirus copied to clipboard

Documenting Testing Expanding AMIDownload

openVirus
openVirus copied to clipboard