pyvo icon indicating copy to clipboard operation
pyvo copied to clipboard

iter_datalinks should be smarter about datalink products

Open msdemlei opened this issue 1 year ago • 2 comments

pyvo/dal/adhoc.py (we ought to have a plan to improve that naming, btw) in iter_datalinks has the following code:

  elif row.access_format == DATALINK_MIME_TYPE:
       yield DatalinkResults.from_result_url(row.getdataurl())

This is for the case of data products that in obscore (or similar protocols) are distributed as datalink documents. There is two types of trouble with this:

(1) it does not work in some interesting scenarios because it is trying attribute access, and the attribute is only defined in "typed" records (obscore, siap2) rather than generic TAP results.

(2) it will confuse error messages whenever iter_datalinks is called on something pyvo cannot find datalinks in.

Consider this code:

import pyvo

svc = pyvo.dal.TAPService("http://dc.g-vo.org/tap")
results = svc.run_sync(
 "select top 5 access_url, 'application/x-votable+xml' as access_format"
 " from dasch.plates")
for dl in results.iter_datalinks():
    print(dl.id)

If you look, dasch.plates access_urls really point at datalinks, so this is a situation the access_format code was designed to handle. But if you run the script, you will see:

Traceback (most recent call last):
  File "/home/msdemlei/zw.py", line 7, in <module>
    for dl in results.iter_datalinks():
  File "/home/msdemlei/gavo/src/pyvo/pyvo/dal/adhoc.py", line 222, in iter_datalinks
    elif row.access_format == DATALINK_MIME_TYPE:
         ^^^^^^^^^^^^^^^^^
AttributeError: 'TAPRecord' object has no attribute 'access_format'

– and that is exactly the message you will also get on tables that have nothing to do with datalink at all, which seems rather misleading to me.

So... I'd suggest the following:

(a) only access access_format when the record actually has that attribute

(b) perhaps support utype- or name-based discovery of the access_format, too (name-based this would make my example work but is deeply in the realm of ad-hoc hacks; utype-based would be a lot cleaner and perhaps good enough. Data providers would have to add constant-valued access_format columns then, but that's not very painful).

(c) produce an error message "No datalinks discernible in this table" or something to that effect when we cannot locate an access format column (and neither find datalinks through any of the more reputable means of attaching them).

Opinions? Should I go ahead with that?

msdemlei avatar Apr 29 '24 12:04 msdemlei

We (Data Central) use the standard obscore names and utypes for SSA responses, are there any other values that make sense other than those?

aragilar avatar Jul 01 '24 09:07 aragilar

On Mon, Jul 01, 2024 at 02:48:57AM -0700, James Tocknell wrote:

We (Data Central) use the standard obscore names and utypes for SSA responses, are there any other values that make sense other than those?

Obscore names are fine, as SSA is not concerned about FIELD names (one of the larger defects of its design, if you ask me). Obscore utypes may not be, because SSA defines their own, prefixed with ssa:. Other than the prefix (which clients probably ignore, so you may get way with keeping it obscore:), these should match whatever is in obscore where there are overlaps, but I suspect they don't always.

But then: If SSA validators (I mostly use https://voparis-validation-reports.obspm.fr/) are happy, I would expect no real SSA clients will have trouble with whatever you do.

msdemlei avatar Jul 01 '24 11:07 msdemlei