pygbif
pygbif copied to clipboard
Trigger a download using multiple predicates
I would like to trigger a download based on a query structured as follows (just an example):
basisOfRecords in ['HUMAN_OBSERVATION', 'LITERATURE']
AND
country
AND
year >= 1000
AND
year <= 2019
AND
hasCoordinate = TRUE
If I try something like this:
test_download = occurrences.download(['basisOfRecord = OBSERVATION',
'basisOfRecord = LITERATURE',
'basisOfRecord = PRESERVED_SPECIMEN',
'basisOfRecord = MATERIAL_SAMPLE',
'basisOfRecord = UNKNOWN',
'basisOfRecord = HUMAN_OBSERVATION',
'country = BE',
'year >= 1000',
'year <= 2019',
'hasCoordinate = TRUE'],
pred_type = 'and')
I get a valid but empty occurrence.txt
file because observations cannot have multiple values of basisOfRecords
. This is clearly a query with multiple levels of predicates involved: an OR within basisOfRecord
values and a general AND for all query keys.
Via rgbif
R package I can do it easily. Here below an example with taxon keys and countries in vectors where values are comma separated:
rgbif::occ_download(
paste0("taxonKey = ", paste(taxon_keys, collapse = ",")),
paste0("country = ", paste(countries, collapse = ",")),
paste0("hasCoordinate = TRUE")
)
Unfortunately, I cannot pass multiple values in this way to pygbif. I am quite new to pygbif, so probably I miss something. However, I didn't find any example in documentation tackling such situations.
Python version:
3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 18:50:55) [MSC v.1915 64 bit (AMD64)]
pygbif version:
> print(pygbif.__version__):
0.3.0
Any help is welcome. Thanks.
thanks for your question @damianooldoni
@stijnvanhoey @peterdesmet you two I think did most of the download methods. Any thoughts on the above?
Here's the JSON body that's sent in that example you gave:
{
"creator": "<hidden>",
"notification_address": [
"<hidden>"
],
"send_notification": "true",
"created": 2019,
"predicate": {
"type": "and",
"predicates": [
{
"type": "equals",
"key": "BASIS_OF_RECORD",
"value": "OBSERVATION"
},
{
"type": "equals",
"key": "BASIS_OF_RECORD",
"value": "LITERATURE"
},
{
"type": "equals",
"key": "BASIS_OF_RECORD",
"value": "PRESERVED_SPECIMEN"
},
{
"type": "equals",
"key": "BASIS_OF_RECORD",
"value": "MATERIAL_SAMPLE"
},
{
"type": "equals",
"key": "BASIS_OF_RECORD",
"value": "UNKNOWN"
},
{
"type": "equals",
"key": "BASIS_OF_RECORD",
"value": "HUMAN_OBSERVATION"
},
{
"type": "equals",
"key": "COUNTRY",
"value": "BE"
},
{
"type": "greaterThanOrEquals",
"key": "YEAR",
"value": "1000"
},
{
"type": "lessThanOrEquals",
"key": "YEAR",
"value": "2019"
},
{
"type": "equals",
"key": "HAS_COORDINATE",
"value": "TRUE"
}
]
}
}
does that look as expected?
any thoughts @stijnvanhoey @peterdesmet ?
The current GbifDownload
class provides the different building blocks required to handle this case by using the object oriented apporach instead of the occurrences.download()
shortcut function:
from pygbif.occurrences.download import GbifDownload
# initiate the download class
gbif_query = GbifDownload('xxxxxxxx', 'xxxxxxxxx') # user name and email
# setup the query
gbif_query.add_predicate('COUNTRY', 'BE', predicate_type='equals')
gbif_query.add_predicate('YEAR', 1000, predicate_type='>=')
gbif_query.add_predicate('YEAR', 2019, predicate_type='<=')
gbif_query.add_predicate('hasCoordinate', TRUE, predicate_type='equals')
# add the multiple values predicate:
gbif_invasive.add_iterative_predicate('basisOfRecord',
['LITERATURE', 'OBSERVATION', 'PRESERVED_SPECIMEN',
'MATERIAL_SAMPLE', 'UNKNOWN', 'HUMAN_OBSERVATION'])
# post request download
gbif_query.post_download('xxxxxxxxx', 'xxxxxx')
So, it is more a matter of missing documentation...
the gbif_invasive.predicates
attribute shows the predicates as setup:
> gbif_invasive.predicates
[{'key': 'COUNTRY', 'type': 'equals', 'value': 'BE'},
{'key': 'YEAR', 'type': 'greaterThanOrEquals', 'value': 1000},
{'key': 'YEAR', 'type': 'lessThanOrEquals', 'value': 2019},
{'key': 'hasCoordinate', 'type': 'equals', 'value': True},
{'predicates': [
{'key': 'basisOfRecord', 'type': 'equals', 'value': 'HUMAN_OBSERVATION'},
{'key': 'basisOfRecord', 'type': 'equals', 'value': 'UNKNOWN'},
{'key': 'basisOfRecord', 'type': 'equals', 'value': 'MATERIAL_SAMPLE'},
{'key': 'basisOfRecord', 'type': 'equals', 'value': 'PRESERVED_SPECIMEN'},
{'key': 'basisOfRecord', 'type': 'equals', 'value': 'OBSERVATION'},
{'key': 'basisOfRecord', 'type': 'equals', 'value': 'LITERATURE'}],
'type': 'or'}]
and these are combined with the gbif_invasive.main_pred_type
(default and
) which is combined into the gbif_invasive.payload
:
{'created': 2019,
'creator': <hidden>,
'notification_address': <hidden>,
'send_notification': 'true'
'predicate': {'predicates': [
{'key': 'COUNTRY', 'type': 'equals', 'value': 'BE'},
{'key': 'YEAR', 'type': 'greaterThanOrEquals', 'value': 1000},
{'key': 'YEAR', 'type': 'lessThanOrEquals', 'value': 2019},
{'key': 'hasCoordinate', 'type': 'equals', 'value': 'TRUE'},
{'predicates': [
{'key': 'basisOfRecord', 'type': 'equals', 'value': 'HUMAN_OBSERVATION'},
{'key': 'basisOfRecord', 'type': 'equals', 'value': 'UNKNOWN'},
{'key': 'basisOfRecord', 'type': 'equals', 'value': 'MATERIAL_SAMPLE'},
{'key': 'basisOfRecord', 'type': 'equals', 'value': 'PRESERVED_SPECIMEN'},
{'key': 'basisOfRecord', 'type': 'equals', 'value': 'OBSERVATION'},
{'key': 'basisOfRecord', 'type': 'equals', 'value': 'LITERATURE'}],
'type': 'or'}],
'type': 'and'}
}
@damianooldoni can you have a check if this is correct and similar to the rgbif request?
@damianooldoni any time to take a look at question from @stijnvanhoey ? if not, i'll take a look
yes, @sckott . Actually It totally slipped my mind. Indeed, the query posted by @stijnvanhoey is similar to the query sent via rgbif. An example in R here below:
countries <- c("BE", "NL")
basis_of_record <- c("HUMAN_OBSERVATION", "LITERATURE")
year_begin <- 1990
year_end <- 1991
rgbif::occ_download(
paste0("basisOfRecord = ", paste(basis_of_record, collapse = ",")),
paste0("country = ", paste(countries, collapse = ",")),
paste0("hasCoordinate = TRUE"),
paste0("year >= ", year_begin),
paste0("year <= ", year_end)
)
which results in following API query:
{
"type": "and",
"predicates": [
{"type": "or", "predicates": [
{"type": "equals", "key": "BASIS_OF_RECORD", "value": "HUMAN_OBSERVATION"},
{"type": "equals", "key": "BASIS_OF_RECORD", "value": "LITERATURE"}
]},
{"type": "or", "predicates": [
{"type": "equals", "key": "COUNTRY", "value": "BE"},
{"type": "equals", "key": "COUNTRY", "value": "NL"}
]},
{"type": "equals", "key": "HAS_COORDINATE", "value": "TRUE"},
{"type": "greaterThanOrEquals", "key": "YEAR", "value": "1990"},
{"type": "lessThanOrEquals", "key": "YEAR", "value": "1991"}]
}
This has same structure of the query posted by @stijnvanhoey: only the order changes (type
-key
-value
vs key
-type
-value
), which doesn't change anything of course.
I will double check the solution provided by @stijnvanhoey and if it works this issue can be closed.
As the result is the same, we should improve the documentation of pygbif to make sure this use case is explained to other users as well. Or we could improve the documentation by providing an explanation of the object oriented way of using pygbif more in general?
+1 to improving docs/adding examples
I test is again to be completely sure. Yes, documentation should be improved as well. I can give a try.
I found that this doesn't work:
gbif_query = GbifDownload(xxxxxxx, xxxxxxxxx) # user name and pwd
gbif_query.add_iterative_predicate('basisOfRecord', ['LITERATURE', 'HUMAN_OBSERVATION'])
gbif_query.add_iterative_predicate('taxonKey', [1898286, 1894840])
gbif_query.add_predicate('hasCoordinate', 'TRUE', predicate_type='equals')
gbif_query.post_download(xxxxxxx, xxxxxxxxx) # user name and pwd
while this works:
gbif_query.add_iterative_predicate('BASIS_OF_RECORD', ['LITERATURE', 'HUMAN_OBSERVATION'])
gbif_query.add_iterative_predicate('TAXON_KEY', [1898286, 1894840])
gbif_query.add_predicate('HAS_COORDINATE', 'TRUE', predicate_type='equals')
gbif_query.post_download(xxxxxxx, xxxxxxxxx) # user name and email
This means that the parameters of the shortcut function occurrences.download()
are the typical ones (same as the rgbif
's ones) while we have to use the "raw" versions of them if we want to build queries with .add_predicate()
and .add_iterative_predicate()
.
This has to be documented as well or, even better I think, should be changed. Converting keys automatically (as in occurrence.download()
) allows the user to not change key style (e.g. hasCoordinate
vs HAS_COORDINATE
) while writing complex queries.
@stijnvanhoey , @sckott : what do you think about?
converting for the user makes sense, what do you think @stijnvanhoey ?
@stijnvanhoey ?
I'm sorry, I agree that using the Darwin-core terms make much more sense for the user. I would refactor the input of it before updating the documentation.
thanks @stijnvanhoey - agree we should refactor. Does one of you have time for this? or should I put it on my to do list?
I won't be able to do it the coming weeks, so it would rather be November that I can contribute on this. Currently too busy on remake of pandas documentation ;-)
ok, thanks @stijnvanhoey - pandas docs sounds fun and impt.
I'll probably take a crack at it, but will make sure you two have a look at it
Thanks @sckott . Just back from two weeks holidays and I don't see time to do it even. Still, available for review. So, ping me if needed.
+1 for adding this to the docs. I had to search through the issues to find this info.
duplicate of #104