ome-types icon indicating copy to clipboard operation
ome-types copied to clipboard

Support for retrieving original XMLAnnotation value as string

Open jmuhlich opened this issue 1 year ago • 8 comments

For XMLAnnotations, we know from the schema that "The contents of this is not processed as OME XML but should still be well-formed XML". Inside the OMERO code and the Postgres database, the value is indeed just a plain string and no XML processing at all is done. In fact I've noticed that Glencoe's PathViewer app even stores raw JSON in XMLAnnotations (no wrapping root element or anything)!

I'm trying to add support for XMLAnnotations to omero-cli-transfer, but the way ome-types represents the XMLAnnotation Value as a parsed DOM-like AnyElement structure is making it hard to ship XMLAnnotation content faithfully from OMERO A -> ome-xml -> OMERO B. In order to recreate an annotation in the receiving OMERO instance, I need to provide the Value as a string, but without the opening and closing <Value> tag. I tried calling to_xml() on the annotation or the Value, then doing some DOM munging and re-serializing, but since Value can contain mixed XML content it's actually a little tricky and I'm not confident I can reliably obtain the exact original string. I'm currently leaning toward constructing ome_types XMLAnnotation.Value objects explicitly as Value(any_elements=[value_string_from_omero]) so any actual XML content ends up getting xml-escaped and not parsed, but this seems hacky. Do you have any other thoughts? Could ome-types be modified to retain the original string value in addition to the parsed AnyElement thing? Or maybe the AnyElement representation could be produced on-demand from the stored string rather than serve as the internal representation.

jmuhlich avatar Jan 03 '25 00:01 jmuhlich

Could ome-types be modified to retain the original string value in addition to the parsed AnyElement thing? Or maybe the AnyElement representation could be produced on-demand from the stored string rather than serve as the internal representation.

both of these could perhaps work... I know i should probably be able to glean it from your issue here, but do you happen to have a quick example of an XMLAnnotation (perhaps already in the tests?) and a small test or bit of code demonstrating what you wish you could achieve? that would make it easier to implement

tlambert03 avatar Jan 03 '25 00:01 tlambert03

Sure! The test is to encode the 3 strings below as XMLAnnotation Values, write them to an ome-xml document (inside a larger OME structure of course), then read that ome-xml document back in and recover the original strings exactly. The problem arises as soon as the XMLAnnotations are constructed, before the xml round-trip:

from ome_types.model import OME, XMLAnnotation

values = [
    '{"foo":1,"bar:"other"}',
    '  <outer xmlns="http://example.com/">  <inner>ABC</inner>  </outer>  ',
    'hello <mis> </matched> world',
]

ome = OME()
for v in values:
    ome.structured_annotations.append(XMLAnnotation(value=v))

def get_xmlannotation_value(a):
    # What goes here?
    ...

values2 = [get_xmlannotation_value(a) for a in ome.structured_annotations]
for v, v2 in zip(values, values2):
     print("before:", repr(v))
     print("after: ", repr(v2))
     print()
assert values2 == values, "value mismatch"

I tried a number of different things without success. For example this still leaves a bunch of extra namespace declarations in values 2 and 3, and it seems like a lot of work go back and forth between text and DOM multiple times:

import lxml.etree
def get_xmlannotation_value(a):
     e = lxml.etree.fromstring(a.value.to_xml())
     return e.text + ''.join(lxml.etree.tostring(c, encoding=str) for c in e)

Tangentially, I learned that passing certain invalid XML breaks our to_xml() without raising an actual exception:

>>> print(XMLAnnotation(value='before</mismatched>after').to_xml())
<XMLAnnotation xmlns="http://www.openmicroscopy.org/Schemas/OME/2016-06" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openmicroscopy.org/Schemas/OME/2016-06 http://www.openmicroscopy.org/Schemas/OME/2016-06/ome.xsd" ID="Annotation:38">
  <Value>before</Value>after</XMLAnnotation>

Note that the text "after" is outside <Value></Value> which violates the ome-xml schema.

Maybe we could handle this like the _quantity fields -- .value would contain the raw text and .value_element (or some other name) would contain a parsed/structured representation. In the constructor or when setting .value, perhaps the caller would need to pass an actual DOM node (e.g. lxml.etree.Element) if they want actual XML content. If they pass a plain string then it would be taken as a literal string and we wouldn't implicitly parse it. My omero-cli-transfer code would then need to attempt to parse the values I get from OMERO to decide which path to take, but I think that's OK.

jmuhlich avatar Jan 03 '25 03:01 jmuhlich

I guess there is also the small issue that actual XML content won't have whitespace preserved (like in my second example value) but if it's semantically equivalent that should be fine I guess. Here are 2 examples of actual XML content in XMLAnnotations I found on our OMERO server, for reference:

 <SegmentDefinition>
           <Name>CCND1-high</Name>
           <CollectionOrder>1</CollectionOrder>
           <DisplayColor>#83e8f5</DisplayColor>
           <BlueSelection>3</BlueSelection>
           <GreenSelection>3</GreenSelection>
           <YellowSelection>3</YellowSelection>
           <RedSelection>1</RedSelection>
           <Erode>1</Erode>
           <Dilate>2</Dilate>
           <HoleSize>160</HoleSize>
           <ParticleSize>50</ParticleSize>
         </SegmentDefinition>
 <ChannelIntensity>
           <Name>Blue</Name>
           <Minintensity>10</Minintensity>
           <Maxintensity>3594</Maxintensity>
         </ChannelIntensity>

jmuhlich avatar Jan 03 '25 05:01 jmuhlich

A quick but big 👍🏽 from my side for having strong support of XMLAnnotations in ome-types. This might be something that needs to be bubbled upstream as well so that we can achieve well-defined semantics. That's obviously a bigger project, but since the XMLAnnotation element says:

The contents of this is not processed as OME XML but should still be well-formed XML.

I could see wrapping any text like JSON with a minimal XML block. (We could go so far as to define a well-known element like <EmbeddedJSON/>...) This would then make the decoding of symbols like & which are tricky in XML less guess work.

cc: @chris-allan @sbesson @erickmartins

joshmoore avatar Jan 03 '25 10:01 joshmoore

This is off-topic but since the OME team has been summoned :grin: , why does PathViewer store raw JSON in XMLAnnotations in the first place rather than something like CommentAnnotation? Was it just to hide it from end users (since the interactive clients don't expose XMLAnnotations)?

jmuhlich avatar Jan 03 '25 16:01 jmuhlich

Quick update on this thread and specifically https://github.com/tlambert03/ome-types/issues/273#issuecomment-2568661667. I was reminded of its existence @joshmoore as part of the discussions in https://github.com/ome/bioformats/pull/4358. Some recent testing indicates newer version of ome-types have stricter validation rules on write. In particular, storing an XMLAnnotation with a value attribute containing unescaped angle brackets is now failing but values with escaped characters are working:

>>> from ome_types.model import XMLAnnotation
>>> print(XMLAnnotation(value='before</mismatched>after').to_xml())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/private/tmp/venv/lib/python3.12/site-packages/ome_types/_mixins/_kinded.py", line 16, in __init__
    return super().__init__(**data)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/private/tmp/venv/lib/python3.12/site-packages/ome_types/_mixins/_base_type.py", line 89, in __init__
    super().__init__(**data)
  File "/private/tmp/venv/lib/python3.12/site-packages/pydantic/main.py", line 253, in __init__
    validated_self = self.__pydantic_validator__.validate_python(data, self_instance=self)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pydantic_core._pydantic_core.ValidationError: 1 validation error for XMLAnnotation
value
  Value error, mismatched tag: line 1, column 88 [type=value_error, input_value='before</mismatched>after', input_type=str]
    For further information visit https://errors.pydantic.dev/2.11/v/value_error
>>> print(XMLAnnotation(value='before&lt;/mismatched&gt;after').to_xml())
<XMLAnnotation xmlns="http://www.openmicroscopy.org/Schemas/OME/2016-06" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openmicroscopy.org/Schemas/OME/2016-06 http://www.openmicroscopy.org/Schemas/OME/2016-06/ome.xsd" ID="Annotation:1">
  <Value>before&lt;/mismatched&gt;after</Value>
</XMLAnnotation>

sbesson avatar Sep 08 '25 14:09 sbesson

thanks @sbesson!

realistically, I probably won't be able to dig into this specific topic in the near future (swamped currently). So if you and/or @jmuhlich already have a good sense for the problem that needs to be solved here and have time, i would greatly appreciate a PR

if not, that's fine too! Just didn't want to leave this unacknowledged :)

tlambert03 avatar Sep 11 '25 18:09 tlambert03

From my side, the primary issue was that ome-types was able to write invalid XML annotations and this seems to have been fixed in recent versions. There might still be some remaining questions around the round-tripping or annotations as per the original description but this is not something we have immediate drivers or use cases. I'll leave Jeremy to comment if this remains applicable

sbesson avatar Sep 12 '25 07:09 sbesson