astropy icon indicating copy to clipboard operation
astropy copied to clipboard

ENH: adding parquet.votable

Open bsipocz opened this issue 1 year ago • 14 comments

This PR adds a new format parquet.votable to be able to read parquet files that include a votable metadata.

I open this as a draft for the first round of comments, and while adding some narrative documentation.

cc @tomdonaldson @gpdf @afaisst

  • [x] By checking this box, the PR author has requested that maintainers do NOT use the "Squash and Merge" button. Maintainers should respect this when possible; however, the final decision is at the discretion of the maintainer that merges the PR.

bsipocz avatar May 02 '24 20:05 bsipocz

Thank you for your contribution to Astropy! 🌌 This checklist is meant to remind the package maintainers who will review this pull request of some common things to look for.

  • [ ] Do the proposed changes actually accomplish desired goals?
  • [ ] Do the proposed changes follow the Astropy coding guidelines?
  • [ ] Are tests added/updated as required? If so, do they follow the Astropy testing guidelines?
  • [ ] Are docs added/updated as required? If so, do they follow the Astropy documentation guidelines?
  • [ ] Is rebase and/or squash necessary? If so, please provide the author with appropriate instructions. Also see instructions for rebase and squash.
  • [ ] Did the CI pass? If no, are the failures related? If you need to run daily and weekly cron jobs as part of the PR, please apply the "Extra CI" label. Codestyle issues can be fixed by the bot.
  • [ ] Is a change log needed? If yes, did the change log check pass? If no, add the "no-changelog-entry-needed" label. If this is a manual backport, use the "skip-changelog-checks" label unless special changelog handling is necessary.
  • [ ] Is this a big PR that makes a "What's new?" entry worthwhile and if so, is (1) a "what's new" entry included in this PR and (2) the "whatsnew-needed" label applied?
  • [ ] At the time of adding the milestone, if the milestone set requires a backport to release branch(es), apply the appropriate "backport-X.Y.x" label(s) before merge.

github-actions[bot] avatar May 02 '24 20:05 github-actions[bot]

👋 Thank you for your draft pull request! Do you know that you can use [ci skip] or [skip ci] in your commit messages to skip running continuous integration tests until you are ready?

github-actions[bot] avatar May 02 '24 20:05 github-actions[bot]

Fixed up the codestyle advisories, but two remained and are related to XML. How do we ended up standing with defusedxml? Should I just ignore these for now?

bsipocz avatar May 02 '24 20:05 bsipocz

@pllim - WIW, this again run into some coverage shenanigans, something else than the previous upload fail.

bsipocz avatar May 22 '24 23:05 bsipocz

You have to rebase anyway, I think, to pick up fix for RTD. Also maybe address the pre-commit checks? I don't know why there are two different codecov/project statuses here. Hopefully it would make more sense once you rebase. 🤞

pllim avatar May 23 '24 15:05 pllim

Thank you for your work, this PR would be really useful for hipscat. The standard describes column data types, would this PR or future work cover that?

hombit avatar Jun 07 '24 17:06 hombit

cc @fxpineau as I believe you have an interest in the VOTable metadata in parquet files topics

bsipocz avatar Aug 02 '24 21:08 bsipocz

Thank you @bsipocz, I added the 4 key=value pairs:

  • IVOA.VOTable-Parquet.type = Parquet-local-XML
  • IVOA.VOTable-Parquet.encoding = utf-8
  • IVOA.VOTable-Parquet.content = the XML VOTable without the DATA tag
  • IVOA.VOTable-Parquet.version = 1.0

in the arrow schema of parquet files generated on-the-fly by this service prototype generating LSDB/HiPSCat like products.

fxpineau avatar Aug 09 '24 19:08 fxpineau

@bsipocz, I'm interested in annotating parquet files with rich metadata as well, so I'm pleased to see this under development in Astropy. I hope that whatever approach is taken can be standardised across different software so that such files can be written/read interoperably.

I basically agree with this approach, and I've written Java code, that should eventually make its way into STIL/STILTS/TOPCAT, that interoperates with what you've done here. That means that column metadata such as units and description, encoded in the VOTable annotation, can be written by my java and read by your python, and vice versa. If you're interested in trying it out, let me know and I can make a suitable version of stilts/topcat available.

I have comments on a few details though. The parquet metadata written here, as noted by @fxpineau above, writes to the parquet file extra metadata key-value items with the keys IVOA.VOTable-Parquet.content, IVOA.VOTable-Parquet.type, IVOA.VOTable-Parquet.version and IVOA.VOTable-Parquet.encoding.

  • There is also an ARROW:schema key whose value is a base64-encoded string that appears to duplicate all of the above key/value pairs (and may contain some additional arrow-related metadata). While that doesn't seem to break anything, it would be cleaner if the VOTable metadata were not repeated there.
  • The number of keys could be reduced with the aim of simplicity. Although the python code as it stands writes four items, it only reads one of them (IVOA.VOTable-Parquet.content), which is sufficient to detect whether this convention is present and to read the metadata. At least the encoding item could go; we could just mandate UTF-8 to reduce the burden on readers. It would also be possible to remove the type item since it's implied by the presence of the content item.
  • The VOTable contains a DATA element like <ns0:DATA><ns0:PARQUET type="Parquet-local-XML" /></ns0:DATA></ns0:TABLE>. That means that the embedded XML is not a valid VOTable document, since the PARQUET element is not a legal child of the DATA element (it's not defined in the VOTable schema at all). It's not essential that this document is a legal VOTable, but it's tidier if it is, and it would make validation easier. So I'd suggest just removing the DATA element altogether, with the understanding that the first/only DATA-less TABLE element is the one that annotates the parquet data. This is what @fxpineau's implementation currently does, and corresponds to the way that the similar FITS-plus convention has worked in TOPCAT for a long time. Again, the existing python input code does not rely on this DATA element or check for its presence.

mbtaylor avatar Sep 09 '24 09:09 mbtaylor

I should also comment that both the astropy and java readers successfully pick up the VOTable metadata in parquet files written by the CatHAGrid service mentioned by @fxpineau above.

mbtaylor avatar Sep 10 '24 08:09 mbtaylor

Hi humans :wave: - this pull request hasn't had any new commits for approximately 4 months. I plan to close this in 30 days if the pull request doesn't have any new commits by then.

In lieu of a stalled pull request, please consider closing this and open an issue instead if a reminder is needed to revisit in the future. Maintainers may also choose to add keep-open label to keep this PR open but it is discouraged unless absolutely necessary.

If this PR still needs to be reviewed, as an author, you can rebase it to reset the clock.

If you believe I commented on this pull request incorrectly, please report this here.

github-actions[bot] avatar Oct 18 '24 05:10 github-actions[bot]

@pllim when has the bot changed to 4 months, could it be changed back to 6? (I mean this is not the only PR that got prematurely bot commented...)

bsipocz avatar Oct 18 '24 06:10 bsipocz

Actually, it never changed (still about 5 months) but given that each month is not same days, the humanize.naturaldelta function sometimes says 5, sometimes 4. The default is still "5 months" assuming each month is 30 days. So this time, it must have hit an edge case.

See also:

  • https://github.com/pllim/action-astropy-stalebot
  • https://github.com/pllim/action-astropy-stalebot/blob/aed4b8aa0b1cbdb4ef09d476e46ebe0768e75d4b/stale_pull_requests.py#L217

pllim avatar Oct 18 '24 13:10 pllim

So the total time from open to close is still 6 months. At 5 months or so, it issues an warning and then close another month later.

pllim avatar Oct 18 '24 13:10 pllim

Context that may not be clear for those not following the IVOA lists: A lot of people are interested in this and in the coming month we will iterate a lot on the generic picture. So delaying coming up with the metadata handling defaults and behaviour made a lot of sense here.

And docs is to come next week when I'll be in the same room as Andreas.

bsipocz avatar Oct 26 '24 02:10 bsipocz