ALA4R icon indicating copy to clipboard operation
ALA4R copied to clipboard

outlierForLayer field provides data for only one layer

Open shawnlaffan opened this issue 6 years ago • 9 comments

This is related to #27.

As an example, the online search for Acacia cangaiensis produces one record that is flagged as an outlier for three layers, Bio15, Bio17 and Bio26.

https://biocache.ala.org.au/occurrences/d97cd2e1-c871-4be5-bd50-2b963f210902

However, the data downloaded via ALA4R give only one layer, el882, which corresponds to Bio15.

Can more information be packed into this field? Or a new field be provided? A comma separated list should work well enough to state which layers a record is an outlier for.

Code to reproduce is below.

Thanks, Shawn.

library(ALA4R)

search_term = "Acacia cangaiensis"
wkt_text = "POLYGON((154 -43.74,154 -9,112.9 -9,112.9 -43.74,154 -43.74))"

ala = occurrences(taxon=search_term, wkt=wkt_text, download_reason_id=7)
ala$data = ala$data[!(is.na(ala$data$longitude) | is.na(ala$data$latitude)),]
ala$data[ala$data$id == 'd97cd2e1-c871-4be5-bd50-2b963f210902', 'outlierForLayer']

shawnlaffan avatar Mar 31 '18 00:03 shawnlaffan

Hi @shawnlaffan,

Can more information be packed into this field? Or a new field be provided? A comma separated list should work well enough to state which layers a record is an outlier for.

Are you wanting to get the data for the other outlier layers as separate columns or are you simply needing more information about the el882 layer included in that field?

Also, is it a case that the outlier for layer X assertion data NOT coming through in the download, and this would be sufficient?

A user story or use case would be helpful to frame the request, as well.

nickdos avatar Apr 04 '18 00:04 nickdos

Hi @nickdos,

My use case is to identify records that are outliers for two or more env layers. Many of the records that are single layer outliers seem to be OK for my purposes (admittedly that's not based on rigorous testing, though).

I had a look at the API pages, and the problem might be at the API level where the table is generated since a direct check also gives only one layer. Of course, now I cannot reproduce that since I forget which search I used. Perhaps it is the csv generation component.

In any case, direct json access contains the three outlier layers. Snippet from https://biocache.ala.org.au/ws/occurrence/d97cd2e1-c871-4be5-bd50-2b963f210902 :

processed |  
-- | --
rowKey | "dr376\|MEL\|MEL0618363A"
uuid | "d97cd2e1-c871-4be5-bd50-2b963f210902"
occurrence |  
basisOfRecord | "PreservedSpecimen"
modified | "2000-12-08"
occurrenceStatus | "present"
recordedBy | "Beauglehole, A.C."
outlierForLayers |  
0 | "el882"
1 | "el889"
2 | "el894"

In terms of packing the info into the existing structures in ALA4R, multiple columns would work, but would get unwieldy pretty quickly, hence packing them into a single entry might be good, e.g. "el882;el883;el887". A space or semicolon would actually be a better separator than a comma, as otherwise csv parsing libs come into play.

Hopefully that helps explain things a bit more.

Shawn.

shawnlaffan avatar Apr 04 '18 01:04 shawnlaffan

Just an update.

This is the record returned via the ALA4R::occurrences() call. The outlierForLayer field lists el882, but el889 and el894 are not listed.

"","id","catalogNumber","matchTaxonConceptLsid","scientificNameOriginal","commonName","scientificName","rank","kingdom","phylum","class","order","family","genus","species","subspecies","institutionCode","collectionCode","locality","latitudeOriginal","longitudeOriginal","geodeticDatum","latitude","longitude","coordinateUncertaintyInMetres","country","IBRA7Regions","IMCRA4Regions","state","localGovernmentAreas","minimumElevationInMetres","maximumElevationInMetres","minimumDepthInMeters","maximumDepthInMeters","collector","year","month","eventDate","basisOfRecordOriginal","basisOfRecord","sex","outlierForLayer","taxonIdentificationIssue","locationQuality","altitudeNonNumeric","assumedPresentOccurrenceStatus","badlyFormedBasisOfRecord","coordinatePrecisionMismatch","dataAreGeneralised","decimalLatLongConverted","firstOfMonth","firstOfYear","geodeticDatumAssumedWgs84","incompleteCollectionDate","inferredDuplicateRecord","invalidCollectionDate","occCultivatedEscapee","uncertaintyRangeMismatch","unrecognisedCollectionCode","unrecognisedInstitutionCode","unrecognisedOccurrenceStatus","unrecognizedGeodeticDatum"
"30","d97cd2e1-c871-4be5-bd50-2b963f210902","MEL 0618363A","http://id.biodiversity.org.au/node/apni/2894960","Acacia cangaiensis Tindale & Kodela","","Acacia cangaiensis","species","Plantae","Charophyta","Equisetopsida","Fabales","Fabaceae","Acacia","Acacia cangaiensis","","MEL","MEL","Wannon River Falls Reserve, 19 km WNW of Hamilton Post Office.",-37.6667,141.8333,"",-37.6667,141.8333,10000,"Australia","Victorian Midlands","","Victoria","Southern Grampians (S)",NA,NA,"","","Beauglehole, A.C.",1978,2,"1978-02-06","PreservedSpecimen","PreservedSpecimen","","el882","noIssue",TRUE,FALSE,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE

shawnlaffan avatar Apr 04 '18 02:04 shawnlaffan

Thanks @shawnlaffan. Looks like a bug (or feature) where the SOLR index has a multiValued field type but the download code is only grabbing the first value. I've logged an issue (linked above).

nickdos avatar Apr 04 '18 03:04 nickdos

Thanks @nickdos

shawnlaffan avatar Apr 04 '18 03:04 shawnlaffan

See latest comment on https://github.com/AtlasOfLivingAustralia/biocache-service/issues/195#issuecomment-445593054 for a fix

nickdos avatar Dec 10 '18 00:12 nickdos

Thanks @nickdos

shawnlaffan avatar Dec 10 '18 03:12 shawnlaffan

I'm not very knowledgeable on ALA4R so I'm not sure if the fix suggested requires a code fix in ALA4R or not. @peggynewman any ideas?

nickdos avatar Dec 10 '18 04:12 nickdos

Yes @nickdos @shawnlaffan it's an ALA4R code fix. I'll label this a bug so it can go through in the next release.

peggynewman avatar Dec 10 '18 05:12 peggynewman