pyvo icon indicating copy to clipboard operation
pyvo copied to clipboard

Add global discovery

Open msdemlei opened this issue 1 year ago • 30 comments

This PR should eventually add a facility for global dataset discovery to pyVO. Specifically, people should be able to say "give me all images in the VO showing M42 around 42nm on MJD 54125."

Realistically, we ought to allow intervals on input and probably multiple positions (though we should refuse SIA1 for multiple objects).

At the moment, none of this will work; I'm missing two very basic building blocks that I won't have time to locate before I'm going on holiday, both related to the way pyvo treats query results:

  • I need to remove rows from registry query results. That's nasty because RegistryResults reads from the astropy.VOTable instance, and that doesn't support dropping rows so far for all I can see.
  • I need to turn whatever comes back from SIA1 and TAP services to sia2.ObsCoreRecord-s. Again, as implemented so far, this wants a VOTable to pick stuff from.

If there is no better solution for both, we probably need to turn things into astropy tables, manipulate them, and then turn them back to VOTables. It'd be great if there were a less convoluted way to do this. Is there?

Even if that's there, this won't be ready for prime time; we will at least need defined timeouts on all services involved, because there's always some services in the VO hanging clients for almost arbitrary times.

Once we have the image case, a similar modules for spectrum discovery (in SSA and Obscore for now) should be simple.

We probably should talk to the IVOA to make throwing out capabilities leading to the same data possible/more robust; I think we ought to suggest that SIAv2 services over data that's also there in Obscore services should have an IsDerivedFrom set to the respective TAP service (or Obscore resource).

So, there's a metric ton of TODOs – I'm grateful for commits fixing any of them.

Ah – and before you ask: Testing code like this that talks to services all over the internet is of course a particular challenge; I think I have a plan, so don't worry about that particular issue too much at this point.

msdemlei avatar Aug 07 '23 15:08 msdemlei

Codecov Report

Attention: Patch coverage is 37.90323% with 77 lines in your changes are missing coverage. Please review.

Project coverage is 79.13%. Comparing base (befc7bc) to head (bb716b4). Report is 73 commits behind head on main.

:exclamation: Current head bb716b4 differs from pull request most recent head 5acf4b0. Consider uploading reports for the commit 5acf4b0 to get more accurate results

Files Patch % Lines
pyvo/discover/image.py 23.00% 77 Missing :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #470      +/-   ##
==========================================
- Coverage   81.24%   79.13%   -2.11%     
==========================================
  Files          69       54      -15     
  Lines        7129     6155     -974     
==========================================
- Hits         5792     4871     -921     
+ Misses       1337     1284      -53     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Aug 07 '23 15:08 codecov[bot]

So... I think this could now be worth a look. See also my blog post on this: https://blog.g-vo.org/global-dataset-discovery-in-pyvo.html.

The CI failures look interesting – I half fear they are related to implementation details changing the request hashes that utils.testing generates. Perhaps we'll have to properly scaffold these things after all? Anyway, I'll look at this some other day; it shouldn't impact discovery as such.

Meanwhile, on the usual codestyle failures I have to say I'm fairly unhappy that I'll now have to say foo / 2 rather than foo/2 – does that really help someone? And do people here in pyVO actually prefer the former?

If we have to have these checks in place: Can someone recommend a formatter/pretty-printer that I can run halfway blindly on the sources? Frankly, I don't think I'd like to get used to typing foo / 2 or 1 + 2, so if this kind of thing can happen in a pre-commit hook, that'd make my life easier...

msdemlei avatar Feb 26 '24 12:02 msdemlei

f we have to have these checks in place: Can someone recommend a formatter/pretty-printer that I can run halfway blindly on the sources? Frankly, I don't think I'd like to get used to typing foo / 2 or 1 + 2, so if this kind of thing can happen in a pre-commit hook, that'd make my life easier...

even old basic tools could fix this, autopep8, or I have flymake set up on my emacs to check on the codestyle (and it should pick up the setting from the repo)

bsipocz avatar Mar 06 '24 16:03 bsipocz

So... I think modulo review points and perhaps a few style matters, this could be merged, with a caveat that the API is probably not stable yet. It does something interesting, and although data providers still can do quite a bit to make this work better, it's a lot more likely they will do that if there is merged code in pyVO.

From my perspective, the main problem is testing. Since there is so much network activity going on here, my plan has been to ship frozen network answers as part of the source (see utils.testing for the general scheme).

It turns out that that is nowhere near as simple as I had hoped. The test failures with old versions of python are almost certainly due to some subtle difference in the serialisation of query parameters; they might be fixable by normalising these a bit more aggressively, but perhaps they are not; I'd have to pull up systems with these old versions to look.

But: Is this the right way to go about this? Perhaps the instrumentation should sit in TAPService, where the parameters are much less dependent on versions than where requests.requests has written its payload? Or should we, especially after the xz disaster, try harder to avoid what looks like binary blobs in the VCS in the first place? But then how do we do the scaffolding for these tests without going insane?

Thoughts and/or ideas?

msdemlei avatar Apr 15 '24 11:04 msdemlei

Hi Markus, With which python version do the tests pass without access to remote data? Maybe there was a change in the way hashs are calculated that we could identify in python's changelogs?

ManonMarchand avatar Apr 16 '24 13:04 ManonMarchand

On Tue, Apr 16, 2024 at 06:46:34AM -0700, Manon Marchand wrote:

With which python version do the tests pass without access to remote data? Maybe there was a change in the way hashs are calculated that we could identify in python changelogs?

I'm rather sure it's not related to hash computations as such; these are fairly stable.

In the current checks, things already fail for python 3.9. But good thing you ask, because it made me realise that for python 3.8, things already fail because there is no functools.cache there. At least that I can fix. Perhaps we can learn something from the emerging pattern.

msdemlei avatar Apr 16 '24 14:04 msdemlei

Ah well... failures up and down. I'll dig a bit deeper tomorrow, but this does look as if we're going to need a giant scaffold. Oh dang.

msdemlei avatar Apr 16 '24 14:04 msdemlei

Oh yikes. I've tried this on two different Debian relases (so, I'm not reproducing github's environment). The first instance for differing hashes is because astropy includes something like "Produced with astropy.io.votable version 5.2.1" in its VOTables (which get uploaded in the course of the tests). Updating the cached files depending on the underlying astropy release obviously is not an option.

I think the testing plan with intercepting and caching https requests and responses and recovering the responses based on hashes of the requests is fundamentally broken for something as complex as the global discovery.

I suppose we'll have to mock at a higher level. If someone has good ideas how to go about this: I'd be most grateful.

msdemlei avatar Apr 18 '24 06:04 msdemlei

I suppose we'll have to mock at a higher level. If someone has good ideas how to go about this: I'd be most grateful.

I don't have any suggestions without diving into this properly, but will try to find time to do so next week.

bsipocz avatar Apr 18 '24 16:04 bsipocz

After some deliberation I have decided that doing local tests properly for many of the features is really hard (and would hide a lot of interesting bugs). Hence, I have, with a solid helping of regret, made the more interesting tests remote. They're still relatively quick.

So... Volunteers for reviewing this? It'd be lovely to have it merged into main for improved visibility, even though there will be quite a bit of refactoring once we also do global spectrum discovery.

The codecov analysis is way off (I suspect they don't do remote tests). In my local pytest --cov, I only have 14 uncovered lines, and most of these are odd-ish (well, the obscore vs. tap-obscore thing should eventually be covered, but that's not easy and I hope this code will be obsolete rather soon anyway).

The doctest failures... I think that's a separate matter independent of this PR. We really should think about dropping a few of them or at least figure out why the doctests take so long in the first place.

Documentation building has some fairly odd messages, some of which may be introduced with the PR. I'd appreciate if someone with more automodapi-fu than me could have a look at that and figure out how to silence/fix these diagnostics.

msdemlei avatar May 15 '24 06:05 msdemlei

So... Volunteers for reviewing this? It'd be lovely to have it merged into main for improved visibility, even though there will be quite a bit of refactoring once we also do global spectrum discovery.

I know that there are some disagreements about this approach, maybe we should clear that up first next week, then I can put this on my list to review.

The codecov analysis is way off (I suspect they don't do remote tests). In my local pytest --cov, I only have 14 uncovered lines, and most of these are odd-ish (well, the obscore vs. tap-obscore thing should eventually be covered, but that's not easy and I hope this code will be obsolete rather soon anyway).

I'll check this once we do the code review, but recently there were some codecov issues elsewhere, too, so it may very well be a red herring.

The doctest failures... I think that's a separate matter independent of this PR. We really should think about dropping a few of them or at least figure out why the doctests take so long in the first place.

separate, I'm opening the fix for it.

Documentation building has some fairly odd messages, some of which may be introduced with the PR. I'd appreciate if someone with more automodapi-fu than me could have a look at that and figure out how to silence/fix these diagnostics.

I volunteer for fixing these, once we go ahead with the code review.

bsipocz avatar May 16 '24 03:05 bsipocz

@trjaffe - Could you, and/or someone from the registry group, have a look at this? I recall there were concerns, but the details (affects on this PR) are not super clear to us maintainers, so it would be great to get some feedback before we jump into the code review ourselves.

bsipocz avatar Jun 14 '24 23:06 bsipocz

On Fri, Jun 14, 2024 at 03:53:12PM -0700, Brigitta Sipőcz wrote:

Some things seems to be unrelated but useful bugfixes. I wonder if they can be done separately, thus we could include them in a release quicker?

They could, but to be very frank I would much rather see the whole thing merged for 1.6 so we have at least something in terms of global dataset discovery in there. This has been lingering far too long (and that after I've procrastinated starting it for much longer even).

If there are serious concerns with the main discovery code that I cannot address quickly, I'd be ok with pulling out the bugfix parts; but they are close enough to the issue of global dataset discovery that I don't think there's a strong logical reason to pull them out.

msdemlei avatar Jun 19 '24 11:06 msdemlei

The NAVO Python working group has been looking at PR 470 and have a few questions. (TJ is just speaking for the team.)

We hope to see it merged soon, though we think it needs a few tweaks like marking the module as a prototype first. A few of us are looking at it, but I also have these questions for you:

  1. Can you expand on how to use the methods that make up discovery.images_globally()? I mean specifically when the user wants to do some kind of detailed registry search and then hand the resulting list of services. I tried the code shown below and it gives me a TypeError, so I probably have something simple wrong.
import pyvo
from astropy import units as u
from astropy import time
from pyvo import discover 
services = pyvo.registry.search(keywords=["heasarc rass"]) 
im_discoverer = discover.image.ImageDiscoverer(space=(274.6880, -13.7920, 0.1))
im_discoverer.set_services(services) 
results, log = im_discoverer.query_services()
  1. The example given in the docs, the very first code cell on https://docs.g-vo.org/temp-pyvo-global-discovery/discover/index.html, returns no results for me. Is this expected? It runs and the log I get says:
['Skipping ivo://org.gavo.dc/__system__/siap2/sitewide because it is served by ivo://org.gavo.dc/__system__/obscore/obscore',
 'Obscore GAVO Data Center Obscore Table: 0 records']

We should make sure the examples return something useful.

  1. Just to confirm, I think the other issue I ran into is something you mentioned in your blog post that is currently unsolved: doing a registry search, making some selection among the records, and then handing that edited list to the discoverer. I also show what I want to be able to do in the example code below. Let me know if you know of a way to do this. It’s part of a broader awkwardness handling these PyVO objects that are like lists but perhaps not enough like lists. This probably needs a separate issue and PR to address.
records = []
for s in services:
  # Doesn't matter what this is, just make a selection of a subset of services:  
  if "rosat" in s.res_description.lower() and "rass" in s.res_description.lower():
    records.append(s)
# How to create a RegistryResults object from a list of RegistryResource objects? 
selected_services = pyvo.registry.regtap.RegistryResults(records)
im_discoverer.set_services(selected_services)

trjaffe avatar Jul 16 '24 13:07 trjaffe

On Tue, Jul 16, 2024 at 06:48:33AM -0700, trjaffe wrote:

We hope to see it merged soon, though we think it needs a few tweaks like marking the module as a prototype first. A few of us are looking at it, but I

Hm... are you talking about pyvo.utils.prototype? I have to admit I'm not so crazy about that, for one, because this is not what prototype was written for ("PyVO may implement features that are not yet part of an approved standard"; this PR by and large only assembles various approved standards (well, except for the registered obscore tables, but that's just a minor detail). And I'm not too wild about having to tell users "Before you use this, say:

from pyvo.utils.prototype import activate_features activate_features("global discovery")"

or so; I give you it's nice to tell people they are using an experimental API that may well change without advance notice, but I doubt that it helps doing this by telling them to add some boilerplate code into their programs. I won't quarrel, though, I just wanted to make sure that's what you're asking for before putting it in.

  1. Can you expand on how to use the methods that make up discovery.images_globally()? I mean specifically when the user wants to do some kind of detailed registry search and then hand the resulting list of services. I tried the code shown below and it gives me a TypeError, so I probably have something simple wrong.
im_discoverer.set_services(services)
results, log = im_discoverer.query_services()

This fails in the last line because the method has no result, and I feel it should not have one, but that's mainly because I have a semi-religious conviction that ideal python functions either have side effects (like query_services) or return something (like images_globally) and otherwise don't change the program state.

I will happily admit that's not a strong excuse for a perhaps non-intuitive interface. If you think query_services should have the same return value as images_globally, I don't see a practical problem with that.

Anyway, with the current code the right way to view the results when actually constructing a Discoverer manually at thie point is

print(im_discoverer.log_messages) print(im_discoverer.results)

I have added a working example in the documentation.

  1. The example given in the docs, the very first code cell on https://docs.g-vo.org/temp-pyvo-global-discovery/discover/index.html, returns no results for me. Is this expected? It runs and the log I

No... This certainly should return something, but it also shouldn't hit too many services or return too many datasets given that, presumably, it will be run quite a number of times. Optimising this clearly led me astray (and 500 nm isn't a spectral constraint that will suggest few matching resources, either). I have updated the example.

  1. Just to confirm, I think the other issue I ran into is something you mentioned in your blog post that is currently unsolved: doing a registry search, making some selection among the records, and then handing that edited list to the discoverer. I also show what I want to be able to do in the example code below. Let me know if you know of a way to do this. It’s part of a broader awkwardness handling these PyVO objects that are like lists but perhaps not enough like lists. This probably needs a separate issue and PR to address.
records = []
for s in services:
  # Doesn't matter what this is, just make a selection of a subset of services:
  if "rosat" in s.res_description.lower() and "rass" in s.res_description.lower():
    records.append(s)
# How to create a RegistryResults object from a list of RegistryResource objects?
selected_services = pyvo.registry.regtap.RegistryResults(records)
im_discoverer.set_services(selected_services)

Yes, that's a more general problem with pyvo.dal.query.DALResults and derived classes: they expect to be constructed with VOTables, and given they need or want all the metadata, I don't think there is much we can do about that.

However, "random collection of {result records from SIA|SSA|a registry query}" or something like that does sound like a good type to have, one that I could use well in the global discovery, too. However, that's probably not going to work at the level of DALQuery, as TAPResults also derives from it, and I don't think "random collection of rows fetched via TAP" is a meaningful data structure. So, yes, I'd like to keep that question out of this PR.

Thanks for your review and comments!

msdemlei avatar Jul 19 '24 09:07 msdemlei

  print(im_discoverer.log_messages)
  print(im_discoverer.results)

Aha, now I see how you're doing it. This is fine, I just didn't realize what you meant. I like things spelled out, as you've done now.

In re the prototype, yes, that was the suggestion, though not because the standards themselves are not approved yet. It's more that this is a complicated workflow and we want to spend some time testing it and see how it goes reworking some of our tutorials based on it before it becomes 'official'. And realistically speaking, that will take us some more time.

trjaffe avatar Jul 19 '24 16:07 trjaffe

On Fri, Jul 19, 2024 at 09:26:20AM -0700, trjaffe wrote:

In re the prototype, yes, that was the suggestion, though not because the standards themselves are not approved yet. It's more that this is a complicated workflow and we want to spend some time testing it and see how it goes reworking some of our tutorials based on it before it becomes 'official'. And realistically speaking, that will take us some more time.

Hm. I have tried for another 10 minutes to like pyvo.utils.prototype and didn't manage. Frankly, I probably don't even like it for what it was designed for (enabling code relying on remote features that may or may not be present), but that may be because it solve a problem I have never had.

But for what we have here, just telling users "if you use this, be aware that you may have to adapt your program later because we're still designing the API", I don't think it works at all: when people type

pyvo.utils.activate_features("global-discovery")

because they're seeing it in some example, they will fairly certainly not suspect that means `when you type "from pyvo import discover", you're using an unstable API'. When they type it because they read the documentation and know what it means, they've already read that the API may be unstable and don't need to spoil their code with function calls that will appear fairly cryptic later.

So... I have fairly bad aches when contemplating the use prototype here. On my pain level scale https://blog.g-vo.org/building-consensus.html#scale, I'd give it an 7 going towards 8.

Hm: perhaps I can accomodate your concerns if I just emit a warning ("Preprecation", if you will -- I'm surprised that apparently almost no one else has come up with that term yet) to the effect that this API is unstable on importing discover? At least for now, "import pyvo" doesn't pull in discover, and so you would exactly hit the people trying to use the new API, and you could exactly say why you're warning.

Deal?

msdemlei avatar Jul 22 '24 14:07 msdemlei

Hm: perhaps I can accomodate your concerns if I just emit a warning ("Preprecation", if you will -- I'm surprised that apparently almost no one else has come up with that term yet) to the effect that this API is unstable on importing discover? At least for now, "import pyvo" doesn't pull in discover, and so you would exactly hit the people trying to use the new API, and you could exactly say why you're warning.

Deal?

I'm inclined to go along with this. I don't have the same issues with the prototype construct, and would be comfortable using the prototype feature in this circumstance, but I do think the key issue is to somehow make a user aware that this API may evolve rather quickly without necessarily undergoing normal deprecation cycles.

Indeed I think it is important to get this merged to get more people using it and contributing feedback. My gut feeling is that there will be aspects that work well and those that don't and the more usage we can encourage the more improvement can be made. Moving forward I hope we can dramatically reduce the amount of time and overhead in getting changes merged as that will both facilitate and encourage faster evolution of code base.

@msdemlei would you be able to take a pass at resolving the conflicts, fixing the tests, and introducing a warning message of your choice either in the code or documentation? Let us know if we can help with the mechanics of any of that. Then we can do a last Python-focused review and get this merged.

tomdonaldson avatar Aug 02 '24 17:08 tomdonaldson

On Fri, Aug 02, 2024 at 10:41:52AM -0700, Tom wrote:

I'm inclined to go along with this. I don't have the same issues with the prototype construct, and would be comfortable using the prototype feature in this circumstance, but I do think the key issue is to somehow make a user aware that this API may evolve rather quickly without necessarily undergoing normal deprecation cycles.

Aw, if only it were so simple.

The trouble is that if I raise the warning in discover's init.py (which I think makes the most sense and has the advantage of being dead simple), pytest's run fails during test collection, like this:

collected 413 items / 2 errors

==================================== ERRORS ==================================== __________________ ERROR collecting pyvo/discover/init.py __________________ pyvo/discover/init.py:10: in warnings.warn("pyvo.discover's API is still under design in pyVO 1.6 and" E pyvo.utils.prototype.PrototypeWarning: pyvo.discover's API is still under design in pyVO 1.6 and may change without prior notice. Feedback to the authors is most welcome. ___________________ ERROR collecting pyvo/discover/image.py ____________________ [about the same thing] =========================== short test summary info ============================ ERROR pyvo/discover/init.py - pyvo.utils.prototype.PrototypeWarning: pyvo.discover's API is still under d... ERROR pyvo/discover/image.py - pyvo.utils.prototype.PrototypeWarning: pyvo.discover's API is still under d... !!!!!!!!!!!!!!!!!!! Interrupted: 2 errors during collection !!!!!!!!!!!!!!!!!!!! ============================== 2 errors in 1.48s ===============================

I guess the solution would be to configure pytest to ignore PrototypeWarning in pyvo.discover; I'm already doing that in python code for text execution, but, again, the problem here is test collection, and there, the warnings are raised before I get to execute any pytest-specific code (or so I think).

However, at that point my dislike for thick tooling stroke again, as I couldn't find a pytest.ini in my checkout and now suspect that something (tox?) builds it on the fly. So... can anyone give me a hint where the pytest configuration would go?

Meanwhile: Is everyone happy with the warning message's text?

msdemlei avatar Aug 08 '24 08:08 msdemlei

Aw, if only it were so simple.

I'm happy to make the warning work with the CI system if that is the decision we go ahead with instead of the prototype route.

bsipocz avatar Aug 08 '24 17:08 bsipocz

(fyi: we don't use pytest.ini at all, all the pyetst configs are in setup.cfg, but again, to rehash, I'm happy to make the warning in a way that works with our testing etc, once it's clear what type of warning/messaging is decided upon).

bsipocz avatar Aug 08 '24 17:08 bsipocz

I'm very comfortable with the proposed text of the warning (pyvo.discover's API is still under design in pyVO 1.6 and may change without prior notice. Feedback to the authors is most welcome.).

I'm not familiar enough with the possibilities to weigh in on the best place to trigger the warning, but would be happy with anything @bsipocz proposes.

tomdonaldson avatar Aug 08 '24 19:08 tomdonaldson

I haven't thought about this before, but another alternative to the pesky warning is to do a pre-release in pypi for this feature. It's easy to install for whomever wants to try it. The two main consequences with this is that 1. this should be limited in time to a few months up to a year and 2. users need to be informed to try the feature through other means - they will not just stumble on it. That being said, I'm fine with the warning too.

andamian avatar Aug 08 '24 19:08 andamian

Re pypi pre-release

I expect this to be released much sooner, and the API to be still open to changes after 1.6 is out. (But if anyone is expected to run into installation issues doing a pip install from github, then we can certainly push a pre-release to pypi).

bsipocz avatar Aug 08 '24 21:08 bsipocz

The trouble is that if I raise the warning in discover's init.py (which I think makes the most sense and has the advantage of being dead simple), pytest's run fails during test collection

Could explicitly request pytest to check for the warning work here? I'm thinking something like https://docs.pytest.org/en/7.1.x/how-to/assert.html#assertions-about-expected-exceptions

import pytest


def test_zero_division():
    with pytest.raises(ZeroDivisionError):
        1 / 0

but instead, check for the warning you're raising here. If the warning is coming up on import, you might need to wrap the import in the with pytest.raises call.

(PS Hi! I'm a dev working under TJ; I'm doing a quick deep dive to see if I can help with the table dropping/appending issue)

duytnguyendtn avatar Aug 29 '24 15:08 duytnguyendtn

@duytnguyendtn - I'm dealing with the warning, just caught up in the middle of sorting out conflicts and new test failures before pushing back changes here.

bsipocz avatar Aug 29 '24 15:08 bsipocz

Hi @msdemlei!

I need to remove rows from registry query results. That's nasty because RegistryResults reads from the astropy.VOTable instance, and that doesn't support dropping rows so far for all I can see.

I was playing around with the DALResultsTable and I think I might have found a proof of concept that shows deletion is at least possible:

image

Right now, it's quite crude, but after a little of refining and some convenience methods, I think it could allow what you're asking for. Let me know your thoughts! Also @bsipocz your opinion would be appreciated here too, before I get too deep into implementation!

duytnguyendtn avatar Aug 29 '24 20:08 duytnguyendtn

On Thu, Aug 29, 2024 at 01:41:37PM -0700, Duy Tuong Nguyen wrote:

Right now, it's quite crude, but after a little of refining and some convenience methods, I think it could allow what you're asking for. Let me know your thoughts! Also @bsipocz your opinion would be appreciated here too, before I get too deep into implementation!

Well... for the present PR I don't think it would help a lot; I'm quite happy to wrap the services up in Queriable-s, and these wouldn't fit into RegistryResults anyway.

The more general thing, letting people delete from DALResults... Well, I'm sure there are situations when people would be grateful if they were able to do that. Which means I'm mildly in favour of investigating what it would take. On the other hand, mutability is a curse, and we'd have to have a close look at the various classes inheriting from DALResults; if any of them kept indexes, that would be a severe blow to the concept of mutating .array from underneath the VOTable.

Also: Do we know the runtime properties of get_first_table().array? If this does anything non-trivial, then we definitely should not implement things in this way; once we offer deletion, people will, I think, expect O(1) runtime in the typical case (an occasional O(n) to pack things is ok, but you don't want an individual deletion to take measurable time).

In short: I'd step carefully :-)

msdemlei avatar Sep 04 '24 13:09 msdemlei

Thanks @msdemlei!

I wouldn't look too closely at the implementation, as that was just a proof a concept; it was mainly a shortcut to "get a table". I was definitely more interested in the philosophy of what effects enabling deletion would have downstream. I proposed a PR to astropy which implements deletion directly at the TableElement level, so a user would need to specifically select which table they are deleting from. This operates directly on the array itself, so the largest operation would be making a copy of the new array, an unfortunate side-effect of the immutability of numpy ndarrays.

As for the downstream effects, I went ahead and PR'ed to astropy because I think VOTables ought to have this capability in the generic case, and this would also force downstream packages to explicitly declare whether their tables are meant to mutable or not. In those cases, they can override the method to disable the user's ability to delete from the TableElement. I think this exercise of explicitly declaring mutability or immutability for downstream classes would ultimately be healthier for the ecosystem, rather than packages relying silently on the lack of deletion due to lack of implementation.

I'd appreciate a lookover in https://github.com/astropy/astropy/pull/16914, and if there are any other concerns I should be sensitive to, I'd greatly appreciate hearing about them!

duytnguyendtn avatar Sep 04 '24 14:09 duytnguyendtn

On Wed, Sep 04, 2024 at 07:08:20AM -0700, Duy Tuong Nguyen wrote:

largest operation would be making a copy of the new array, an unfortunate side-effect of the immutability of numpy ndarrays.

Yes, and as I commented over on the astropy PR, I think that's by and large a killer; there's a reason why you cannot just pop rows out of numpy ndarrays, and I think that reason applies to us, too.

generic case, and this would also force downstream packages to explicitly declare whether their tables are meant to mutable or not. In those cases, they can override the method to disable the user's ability to delete from the TableElement. I think this exercise of explicitly declaring mutability or immutability for

I agree that being explicit here would be a great thing. And also to provide clear guidance on "Ok, so you have a VOTable, and you want to create a VOTable with more or fewer rows. Here's what to do." Having this for the one-table case first would already be a good thing, paying attention to keeping any additional RESOURCE elements, though (e.g., datalink, perhaps mivot).

But thanks a lot for investigating this. It's always good to get a reality check on one's wishes :-).

msdemlei avatar Sep 05 '24 05:09 msdemlei