openVirus
openVirus copied to clipboard
Documenting Testing Expanding AMIDownload
AMIDownloadTool is a wrapper for various ways of crawling scraping sites. The best developed is biorxiv . This is complex:
- Manual search on
biorxivgives a hit list in HTML - we turn this into a single file ("ResultSet")
- this set points to individual landing sets which we download in HTML
- these then point to individual fulltext.html and fulltext.pdf files
Lezan,. I created a small test in that class which needs manual checking. Does it give the same files as the manual search and click gives? Let's ask @anjackson what the right names are...
I have had a look at AMIDownloadTest, these are the errors I found:
A few seem to have an IllegalThreadStateException which i am not sure how to go about testBiorxivSmall() only fails because there is no variable for the landingpage argument on line 67 and its missing a comma between html and pdf in line 68
testBiorxivClimate could be a false assert statement, it's looking for a folder called metadata and file called page1.html, but it's created a __metadata folders instead with under it called resultSetX.html are those the same thing?
Current Issues in AMIDownloadTest (run through eclipse):
Errors:
testHALSearchResultSet(): picocli.CommandLine$ExecutionException: Error while calling command (org.contentmine.ami.tools.AMIDownloadTool@3a53c76a): java.lang.RuntimeException: nu.xom.ParsingException: The declaration for the entity "HTML.Version" must end with '>'. at line 31, column 3 in http://www.w3.org/TR/html4/loose.dtd
testCreateUnpopulatedCTreesFromResultSet(): source dir target/biorxiv/climate doesnt exist because it was run before testBiorxivClimate
testDownloadAndSearchLongIT(): missing argument for pagesize on line 566
Test Failures:
testAMISearch(): not creating the testsearch3 dir
testSections(): same as above
testSearch(): same
testBiorxivClimate(): assertion error for landingPage (doesn't exist)
testRelativeFile(): assertNotNull fails on file
Have @Ignore'd this test.
On Wed, Apr 8, 2020 at 1:38 PM l-hawizy [email protected] wrote:
Current Issues in AMIDownloadTest (run through eclipse): Errors: testHALSearchResultSet(): picocli.CommandLine$ExecutionException: Error while calling command (org.contentmine.ami.tools.AMIDownloadTool@3a53c76a): java.lang.RuntimeException: nu.xom.ParsingException: The declaration for the entity "HTML.Version" must end with '>'. at line 31, column 3 in http://www.w3.org/TR/html4/loose.dtd testCreateUnpopulatedCTreesFromResultSet(): source dir target/biorxiv/climate doesnt exist because it was run before testBiorxivClimate testDownloadAndSearchLongIT(): missing argument for pagesize on line 566
Test Failures: testAMISearch(): not creating the testsearch3 dir testSections(): same as above testSearch(): same
testBiorxivClimate(): assertion error for landingPage (doesn't exist) testRelativeFile(): assertNotNull fails on file
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/petermr/openVirus/issues/39#issuecomment-610934134, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS6OO5WQEMKPU5YK7P3RLRV4ZANCNFSM4MBDDG7A .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
On naming (if I'm not too late/irrelevant), FWIW, this is the way I'd describe the usual flow:
- From a search, we get multiple SearchResultsPages.
- Combining these gets a Search Results Set.
- Each will usually point to a set of Landing Pages (at least that what we call them at work).
- Each Landing Page should point to the PDF (if it's open access), hopefully using a fairly standard
metatag in the HTML headers, e.g.citation_pdf_urlorDC.identifier tags(section 2.F). - We then have to grab the
fulltext.pdfand convert toscholarly.html.
Unless the papers are HTML fulltext, in which case there are usually no Landing Pages, I think.