pygbif icon indicating copy to clipboard operation
pygbif copied to clipboard

Trigger a download using multiple predicates

Open damianooldoni opened this issue 6 years ago • 19 comments

I would like to trigger a download based on a query structured as follows (just an example):

basisOfRecords in ['HUMAN_OBSERVATION', 'LITERATURE'] 
AND
country 
AND
year >= 1000
AND
year <= 2019
AND
hasCoordinate = TRUE

If I try something like this:

test_download = occurrences.download(['basisOfRecord = OBSERVATION', 
                            'basisOfRecord = LITERATURE',
                            'basisOfRecord = PRESERVED_SPECIMEN',
                            'basisOfRecord = MATERIAL_SAMPLE',
                            'basisOfRecord = UNKNOWN',
                            'basisOfRecord = HUMAN_OBSERVATION',
                            'country = BE',
                            'year >= 1000',
                            'year <= 2019',
                            'hasCoordinate = TRUE'],
                           pred_type = 'and')

I get a valid but empty occurrence.txt file because observations cannot have multiple values of basisOfRecords. This is clearly a query with multiple levels of predicates involved: an OR within basisOfRecord values and a general AND for all query keys.

Via rgbif R package I can do it easily. Here below an example with taxon keys and countries in vectors where values are comma separated:

rgbif::occ_download(
  paste0("taxonKey = ", paste(taxon_keys, collapse = ",")), 
  paste0("country = ", paste(countries, collapse = ",")),
  paste0("hasCoordinate = TRUE")
)

Unfortunately, I cannot pass multiple values in this way to pygbif. I am quite new to pygbif, so probably I miss something. However, I didn't find any example in documentation tackling such situations.

Python version: 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 18:50:55) [MSC v.1915 64 bit (AMD64)]

pygbif version:

> print(pygbif.__version__):
0.3.0

Any help is welcome. Thanks.

damianooldoni avatar Feb 15 '19 16:02 damianooldoni

thanks for your question @damianooldoni

@stijnvanhoey @peterdesmet you two I think did most of the download methods. Any thoughts on the above?

Here's the JSON body that's sent in that example you gave:

{
  "creator": "<hidden>",
  "notification_address": [
    "<hidden>"
  ],
  "send_notification": "true",
  "created": 2019,
  "predicate": {
    "type": "and",
    "predicates": [
      {
        "type": "equals",
        "key": "BASIS_OF_RECORD",
        "value": "OBSERVATION"
      },
      {
        "type": "equals",
        "key": "BASIS_OF_RECORD",
        "value": "LITERATURE"
      },
      {
        "type": "equals",
        "key": "BASIS_OF_RECORD",
        "value": "PRESERVED_SPECIMEN"
      },
      {
        "type": "equals",
        "key": "BASIS_OF_RECORD",
        "value": "MATERIAL_SAMPLE"
      },
      {
        "type": "equals",
        "key": "BASIS_OF_RECORD",
        "value": "UNKNOWN"
      },
      {
        "type": "equals",
        "key": "BASIS_OF_RECORD",
        "value": "HUMAN_OBSERVATION"
      },
      {
        "type": "equals",
        "key": "COUNTRY",
        "value": "BE"
      },
      {
        "type": "greaterThanOrEquals",
        "key": "YEAR",
        "value": "1000"
      },
      {
        "type": "lessThanOrEquals",
        "key": "YEAR",
        "value": "2019"
      },
      {
        "type": "equals",
        "key": "HAS_COORDINATE",
        "value": "TRUE"
      }
    ]
  }
}

does that look as expected?

sckott avatar Feb 15 '19 20:02 sckott

any thoughts @stijnvanhoey @peterdesmet ?

sckott avatar Feb 22 '19 22:02 sckott

The current GbifDownload class provides the different building blocks required to handle this case by using the object oriented apporach instead of the occurrences.download() shortcut function:

from pygbif.occurrences.download import GbifDownload

# initiate the download class
gbif_query = GbifDownload('xxxxxxxx', 'xxxxxxxxx') # user name and email

# setup the query
gbif_query.add_predicate('COUNTRY', 'BE', predicate_type='equals')
gbif_query.add_predicate('YEAR', 1000, predicate_type='>=')
gbif_query.add_predicate('YEAR', 2019, predicate_type='<=')
gbif_query.add_predicate('hasCoordinate', TRUE, predicate_type='equals')
# add the multiple values predicate:
gbif_invasive.add_iterative_predicate('basisOfRecord', 
                                     ['LITERATURE', 'OBSERVATION', 'PRESERVED_SPECIMEN', 
                                      'MATERIAL_SAMPLE', 'UNKNOWN', 'HUMAN_OBSERVATION'])

# post request download
gbif_query.post_download('xxxxxxxxx', 'xxxxxx')

So, it is more a matter of missing documentation...

stijnvanhoey avatar Feb 23 '19 12:02 stijnvanhoey

the gbif_invasive.predicates attribute shows the predicates as setup:

> gbif_invasive.predicates

[{'key': 'COUNTRY', 'type': 'equals', 'value': 'BE'},
 {'key': 'YEAR', 'type': 'greaterThanOrEquals', 'value': 1000},
 {'key': 'YEAR', 'type': 'lessThanOrEquals', 'value': 2019},
 {'key': 'hasCoordinate', 'type': 'equals', 'value': True},
 {'predicates': [
   {'key': 'basisOfRecord', 'type': 'equals', 'value': 'HUMAN_OBSERVATION'},
   {'key': 'basisOfRecord', 'type': 'equals', 'value': 'UNKNOWN'},
   {'key': 'basisOfRecord', 'type': 'equals', 'value': 'MATERIAL_SAMPLE'},
   {'key': 'basisOfRecord', 'type': 'equals', 'value': 'PRESERVED_SPECIMEN'},
   {'key': 'basisOfRecord', 'type': 'equals', 'value': 'OBSERVATION'},
   {'key': 'basisOfRecord', 'type': 'equals', 'value': 'LITERATURE'}],
  'type': 'or'}]

and these are combined with the gbif_invasive.main_pred_type (default and) which is combined into the gbif_invasive.payload:

{'created': 2019,
 'creator': <hidden>,
 'notification_address': <hidden>,
 'send_notification': 'true'
 'predicate': {'predicates': [
   {'key': 'COUNTRY', 'type': 'equals', 'value': 'BE'},
   {'key': 'YEAR', 'type': 'greaterThanOrEquals', 'value': 1000},
   {'key': 'YEAR', 'type': 'lessThanOrEquals', 'value': 2019},
   {'key': 'hasCoordinate', 'type': 'equals', 'value': 'TRUE'},
   {'predicates': [
     {'key': 'basisOfRecord', 'type': 'equals', 'value': 'HUMAN_OBSERVATION'},
     {'key': 'basisOfRecord', 'type': 'equals', 'value': 'UNKNOWN'},
     {'key': 'basisOfRecord', 'type': 'equals', 'value': 'MATERIAL_SAMPLE'},
     {'key': 'basisOfRecord', 'type': 'equals', 'value': 'PRESERVED_SPECIMEN'},
     {'key': 'basisOfRecord', 'type': 'equals', 'value': 'OBSERVATION'},
     {'key': 'basisOfRecord', 'type': 'equals', 'value': 'LITERATURE'}],
    'type': 'or'}],
  'type': 'and'}
}

@damianooldoni can you have a check if this is correct and similar to the rgbif request?

stijnvanhoey avatar Feb 23 '19 12:02 stijnvanhoey

@damianooldoni any time to take a look at question from @stijnvanhoey ? if not, i'll take a look

sckott avatar Aug 06 '19 18:08 sckott

yes, @sckott . Actually It totally slipped my mind. Indeed, the query posted by @stijnvanhoey is similar to the query sent via rgbif. An example in R here below:

countries <- c("BE", "NL")
basis_of_record <- c("HUMAN_OBSERVATION", "LITERATURE")
year_begin <- 1990
year_end <- 1991
rgbif::occ_download(
  paste0("basisOfRecord = ", paste(basis_of_record, collapse = ",")), 
  paste0("country = ", paste(countries, collapse = ",")),
  paste0("hasCoordinate = TRUE"),
  paste0("year >= ", year_begin),
  paste0("year <= ", year_end)
)

which results in following API query:

{
  "type": "and",
  "predicates": [
    {"type": "or", "predicates": [
        {"type": "equals", "key": "BASIS_OF_RECORD", "value": "HUMAN_OBSERVATION"},
        {"type": "equals", "key": "BASIS_OF_RECORD", "value": "LITERATURE"}
      ]},
    {"type": "or", "predicates": [
        {"type": "equals", "key": "COUNTRY", "value": "BE"},
        {"type": "equals", "key": "COUNTRY", "value": "NL"}
      ]},
    {"type": "equals", "key": "HAS_COORDINATE", "value": "TRUE"},
    {"type": "greaterThanOrEquals", "key": "YEAR", "value": "1990"},
    {"type": "lessThanOrEquals", "key": "YEAR", "value": "1991"}]
}

This has same structure of the query posted by @stijnvanhoey: only the order changes (type-key-value vs key-type-value), which doesn't change anything of course.

I will double check the solution provided by @stijnvanhoey and if it works this issue can be closed.

damianooldoni avatar Aug 07 '19 15:08 damianooldoni

As the result is the same, we should improve the documentation of pygbif to make sure this use case is explained to other users as well. Or we could improve the documentation by providing an explanation of the object oriented way of using pygbif more in general?

stijnvanhoey avatar Aug 07 '19 15:08 stijnvanhoey

+1 to improving docs/adding examples

sckott avatar Aug 07 '19 18:08 sckott

I test is again to be completely sure. Yes, documentation should be improved as well. I can give a try.

damianooldoni avatar Aug 08 '19 07:08 damianooldoni

I found that this doesn't work:

gbif_query = GbifDownload(xxxxxxx, xxxxxxxxx) # user name and pwd
gbif_query.add_iterative_predicate('basisOfRecord', ['LITERATURE', 'HUMAN_OBSERVATION'])
gbif_query.add_iterative_predicate('taxonKey', [1898286, 1894840])
gbif_query.add_predicate('hasCoordinate', 'TRUE', predicate_type='equals')
gbif_query.post_download(xxxxxxx, xxxxxxxxx) # user name and pwd

while this works:

gbif_query.add_iterative_predicate('BASIS_OF_RECORD', ['LITERATURE', 'HUMAN_OBSERVATION'])
gbif_query.add_iterative_predicate('TAXON_KEY', [1898286, 1894840])
gbif_query.add_predicate('HAS_COORDINATE', 'TRUE', predicate_type='equals')
gbif_query.post_download(xxxxxxx, xxxxxxxxx) # user name and email

This means that the parameters of the shortcut function occurrences.download() are the typical ones (same as the rgbif's ones) while we have to use the "raw" versions of them if we want to build queries with .add_predicate() and .add_iterative_predicate().

This has to be documented as well or, even better I think, should be changed. Converting keys automatically (as in occurrence.download()) allows the user to not change key style (e.g. hasCoordinate vs HAS_COORDINATE) while writing complex queries. @stijnvanhoey , @sckott : what do you think about?

damianooldoni avatar Aug 08 '19 08:08 damianooldoni

converting for the user makes sense, what do you think @stijnvanhoey ?

sckott avatar Aug 08 '19 17:08 sckott

@stijnvanhoey ?

sckott avatar Aug 30 '19 17:08 sckott

I'm sorry, I agree that using the Darwin-core terms make much more sense for the user. I would refactor the input of it before updating the documentation.

stijnvanhoey avatar Aug 31 '19 10:08 stijnvanhoey

thanks @stijnvanhoey - agree we should refactor. Does one of you have time for this? or should I put it on my to do list?

sckott avatar Sep 13 '19 19:09 sckott

I won't be able to do it the coming weeks, so it would rather be November that I can contribute on this. Currently too busy on remake of pandas documentation ;-)

stijnvanhoey avatar Sep 16 '19 06:09 stijnvanhoey

ok, thanks @stijnvanhoey - pandas docs sounds fun and impt.

I'll probably take a crack at it, but will make sure you two have a look at it

sckott avatar Sep 16 '19 16:09 sckott

Thanks @sckott . Just back from two weeks holidays and I don't see time to do it even. Still, available for review. So, ping me if needed.

damianooldoni avatar Sep 16 '19 21:09 damianooldoni

+1 for adding this to the docs. I had to search through the issues to find this info.

glaroc avatar Mar 02 '22 15:03 glaroc

duplicate of #104

CecSve avatar Feb 17 '23 13:02 CecSve