bioregistry icon indicating copy to clipboard operation
bioregistry copied to clipboard

Discussion about how to improve UCUM

Open cmungall opened this issue 1 year ago • 7 comments

I mentioned this briefly in #460.

For more background on UCUM see

  • https://ucum.org/
  • https://en.wikipedia.org/wiki/Unified_Code_for_Units_of_Measure

This is not a straightforward one, but it's very important to get this right.

First, as far as I can tell there is not an official resolver, NLM do offer some services though

https://ucum.nlm.nih.gov/

Second, the syntax of a UCUM code does not necessarily conform to the syntax for CURIEs. This is less important for using bioregistry as a web resolver, but it's important if we want to standardize how UCUM codes are written as CURIEs. Bioregistry is the best hope at achieving consensus on this. At the moment, some groups are starting to simply write pseudo-curies that ignore W3C syntax. E.g. http://phenopackets.org/phenopacket-tools/constants.html#unit

There is a group in OBO (see units channel, cc @jamesaoverton) who have worked for some time to develop a standard way of writing units as URIs, see https://units-of-measurement.org/

Example: https://units-of-measurement.org/dL.g-1

There is no bioregistry entry for this system, but it is registered with w3id as uom, thus: https://w3id.org/uom/dL.g-1

There is a separate issue for this group to spec out the rules for encoding UCUM codes as CURIEs/URIs:

https://github.com/units-of-measurement/units-of-measurement/issues/45

cmungall avatar Nov 08 '22 04:11 cmungall

Not sure where the confusion was, but this prefix already has been registered: http://bioregistry.io/registry/ucum

cthoyt avatar Nov 08 '22 07:11 cthoyt

oh wow I strongly recommend you mark this as experimental or something there are a lot of issues here, this confuses authority with resolvers, and the majority of CURIEs don't resolve, see encoding issues above

cmungall avatar Nov 08 '22 15:11 cmungall

this confuses authority with resolvers

Right now the Bioregistry doesn't explicitly keep track of whether providers are first-party, but if you think this would give records more context then we can start tracking that.

and the majority of CURIEs don't resolve, see encoding issues above

Yup can confirm. Several of them don't resolve on the units-of-measures but I'm not sure if this means that they're invalid within the nomenclature itself. I'll keep up with the discussion here and on slack and try to support whatever solution comes out as good as possible.

cthoyt avatar Nov 08 '22 16:11 cthoyt

Yup can confirm. Several of them don't resolve on the units-of-measures but I'm not sure if this means that they're invalid within the nomenclature itself.

the examples are all valid UCUM codes but they are not all valid UOM CURIEs (and not valid CURIEs at all). The unofficial u-o-m resolver expects these to be percent-encoded (and normalized to exponent form)

cmungall avatar Nov 08 '22 17:11 cmungall

Alright, then I'll be happy to accept specific suggestions on improvements to this record!

cthoyt avatar Nov 09 '22 15:11 cthoyt

The unofficial u-o-m resolver expects these to be percent-encoded (and normalized to exponent form)

What you are referring to are the UOM final canonical IRIs/CURIEs. Our the software/server allows you to generate them by putting in any UCUM code and then it will create the normalized exponent for you. https://github.com/units-of-measurement/units-of-measurement/pull/48 now merged into UOM clarifies this well enough I think. As for UOM resolving it's not completely finished so not all cases work but for most units it works. E.g. m/d/s becomes -> https://units-of-measurement.org/m.s-1.d-1. In the future this will resolve for all UCUM codes. Hope that helps.

As for you how treat things on bio-registry that's another story I can't comment other than UOM isn't officially endorsed by UCUM, but UOM is allowed to use UCUM.

kaiiam avatar Nov 09 '22 19:11 kaiiam

Looks like the majority of these still don't work

https://bioregistry.io/registry/ucum

Anything with a slash results in a server error: https://bioregistry.io/reference/ucum:dL/g (note that dL/g is not a canonical UOM serialization but it is valid UCUM)

This resolves https://bioregistry.io/reference/ucum:%25

But when we try and resolve with a default provider it gets a 404 https://units-of-measurement.org/%

Surprisingly this works: https://bioregistry.io/reference/ucum:[diop], despite ucum:[diop] not being a valid CURIE (https://github.com/biopragmatics/curies/issues/103). This redirects to https://units-of-measurement.org/[diop], which works, but the actual URL that should be used is encoded https://w3id.org/uom/%5Bdiop%5D,

The more correct https://bioregistry.io/reference/ucum:%5Bdiop%5D also works

There is also the issue that the "default provider" for UCUM is UOM. This is a little problematic. I think the current prefix should be UOM not UCUM, and the regex should forbid [diop] as a local/reference id. You may still want to have a separate entry for UCUM that resolves to official UCUM URLs but these are not precisely the same

cmungall avatar Feb 26 '24 22:02 cmungall