mixs icon indicating copy to clipboard operation
mixs copied to clipboard

measurement value ranges

Open wdduncan opened this issue 4 years ago • 20 comments

For the NMDC, we have a number of samples for which the value of the depth is given as a range. We should discuss how to best handle this. For example, one approach is to adopt DWC min depth, max depth, verbatim depth terms. But, we need to discuss other approach.

Similar range needs also apply to fields like temperature and elevation.

cc @ramonawalls

wdduncan avatar Jun 14 '21 15:06 wdduncan

The need for a depth range comes about when collecting soil samples. The soil being analyzed is taken from a homogenized mix of the soil between two ranges of the core sample. Note, in other samples (such as water), you can specify a specific depth.

Here are some options:

  1. Have 3 terms for depth. depth, minimum_depth, maximum_depth.
  2. Just have two terms for depth: minimum_depth, maximum_depth. If the depth is a single value, then minimum_depth is set to be the same asmaximum_depth. This how time ranges (e.g., start and end date) are sometimes handled in other systems.
  3. Since we are migrating towards managing mixs in linkml, we can create specialized min value and max value properties to modify a depth value. In JSON, it would look something like this:
"depth": 
{ 
  "min value": 1,
  "max value": 5
}

Of the tree options, I'm a bit partial to 2. Although, it comes with some headaches. E.g. What do we with current depth term? Do we keep it around (which in effect amounts to using option 1)? Do drop it and provide guidance on how to migrate?

3 is interesting, but this may be problematic for folks used to filling out spreadsheets.

cc @only1chunts @cmungall @dehays @lschriml

wdduncan avatar Jul 15 '21 21:07 wdduncan

Another note:

Darwin Core also min/max depth values that we may be able to make use of: minimumDepthInMeters, maximumDepthInMeters

wdduncan avatar Jul 15 '21 21:07 wdduncan

@wdduncan - great summary

Regarding your option 3. Regardless of how the schema is implemented, the majority of instance data is in spreadsheets, not JSON. So I would frame your example in terms of what the string serialization would be, which would be modeled in a normalized database in the appropriate way

So I would state your options as

  1. 3 terms
  2. 2 terms: min and max
  3. keep the existing field and allow/force ranges
    • 3a. Syntax "NUMBER-NUMBER UNIT"
    • 3b. Syntax "NUMBER[-NUMBER] UNIT"

cmungall avatar Jul 15 '21 21:07 cmungall

The proposal also has to address backwards and forwarrds compatibility.

If 2 or 3a is chosen, what do we do with existing data that is a single value? Do we create depth=min=max?

cmungall avatar Jul 15 '21 21:07 cmungall

It may be best to go with option 3b "NUMBER[-NUMBER] UNIT" and let the vendors implement field as they see fit; e.g, having a min and max fields in the sample database table.

wdduncan avatar Jul 16 '21 01:07 wdduncan

There is already a great deal of variation in the usage of the term depth, I just did a quick and dirty search, out of ~125k soil samples with a depth field in BioSamples, ~62k include a hyphen "-" within the value, suggesting there is already a fairly large usage of "NUMBER-[NUMBER]" type values. so for backward compatibility, I think option 3b looks most reasonable.

only1chunts avatar Jul 20 '21 10:07 only1chunts

OK, seems like we are in agreement on 3b.

Based on discussion with some of our scientists, I would also like language that ranges are preferred over unitary values. I would use ISO language here, e.g.

""range SHOULD be specified as a range delimited by a hyphen. However, in cases where the range is not known, this MAY be specified as a unitary value"

cmungall avatar Aug 06 '21 16:08 cmungall

I am on board with option 3b as well. Let's finalize on Monday.

@raissameyer you should be aware of this as it may impact your mapping.

ramonawalls avatar Aug 06 '21 16:08 ramonawalls

FWIW, here's the top values for depth in INSDC

count value
61393 0
14776 not applicable
9890 missing
9572 0.1
8397 0.01
5501 0-10 cm
5066 surface
4280 0-20 cm
4151 10 cm
4107 5
3985 0-10cm
3920 NA
3900 not collected
3725 0-20cm
3509 0.0
3323 20 cm
3201 0-15cm
3193 1
3079 10
2944 1m
2877 0 m
2632 0-5 cm
2631 1-10cm
2601 15cm
2443 0.05
2307 5-1000m
2238 0.2
2180 5cm
2163 20
2097 0.5
2096 5 cm
2037 0.1 m
2022 10cm
1981 0.05m
1901 20cm
1883 0-0.1
1867 0m
1666 Unknown
1506 0.3
1486 0.5m
1439 1 m
1433 0.025
1412 50fsw
1395 2 m
1352 3
1342 15 cm
1336 0.01 m
1318 0-15 cm
1141 [0m-40m]
1114 5m
1084 30

cmungall avatar Aug 06 '21 16:08 cmungall

Discussed on call on Aug. 9 and agreed on 3B

ramonawalls avatar Aug 09 '21 15:08 ramonawalls

Update other similar terms.

ramonawalls avatar Aug 09 '21 15:08 ramonawalls

TODO: change syntax to "NUMBER[-NUMBER] UNIT" for depth.

Leave this issue open for MIxs7. Some fields should have a range, whereas some should have errors.

ramonawalls avatar Aug 23 '21 15:08 ramonawalls

We need to consider cases in which negative numbers are used (e.g., temps below freezing).

Use cases I can think of:

  • A single negative number (e.g., -10 C). The - needs to interpreted correctly.
  • A two negative numbers (e.g., -10 to -20 C). Should we require the second number to be in parens (e.g., -10-(-20) C)
  • A positive and a negative number (e.g., 5 to -10 C or -10 to 5 C). Using parens, the first would look like 5-(-10) C. The second would simply be -10-5 C.

Are these examples clear? Are the parens too confusing?

wdduncan avatar Sep 07 '21 21:09 wdduncan

all great stuff, but too much to implement in v6, so I am removing this ticket from the v6 project and labelling with v7 discussion label.

only1chunts avatar Oct 04 '21 15:10 only1chunts

We need to consider cases in which negative numbers are used (e.g., temps below freezing).

Use cases I can think of:

  • A single negative number (e.g., -10 C). The - needs to interpreted correctly.
  • A two negative numbers (e.g., -10 to -20 C). Should we require the second number to be in parens (e.g., -10-(-20) C)
  • A positive and a negative number (e.g., 5 to -10 C or -10 to 5 C). Using parens, the first would look like 5-(-10) C. The second would simply be -10-5 C.

Are these examples clear? Are the parens too confusing?

To add some context to Bills recommendation for negative values. One use case happens in peatland. In this ecosystem there's undulation. "Lower" sections called the hollows and "raised" sections called hummocks. When sampling soil, "distance from the surface" isn't always relative. So, 0-10cm from the surface of the hollow is the parallel depth as 10-20cm from the surface of the hummock. In the case of researched I've been involved in, to work around this "like depth, different location" issues. we added -0-10 as "distance below the surface of the hollow", and +0+10 as distance above the surface of the hollow and into the hummock. This also keeps all subsequent depths aligned. Here's an image to hopefully help detail this. : https://drive.google.com/file/d/1Tbwadh1hvLQqtGEFKOVZAY1iZESPXtQx/view?usp=sharing

Also, sometimes, even if not relevant or needed, people will include -0-10 vs 0-10, even if it's the same thing.

image

mslarae13 avatar Oct 20 '21 19:10 mslarae13

Proposal on 2022-07-26: Break up value in to atomic fields: e.g. one field each for:

  • depth
  • start begin
  • depth end
  • unit

How does this affect user experience and tools to parse data?

We need to consider if it is best to simply have start and end fields, with those being equal for single point cases.

wdduncan avatar Jul 26 '22 15:07 wdduncan

I'm not sure I understand "start begin" & "depth end". Are you saying separate the depth values when there's 2 (soil, sediment) and use begin and end.. and depth when there's only 1 (water)

User experience, it's another column in an already wide sheet. BUT might bring their attention to "this should be a range" & make validation easier.

Note for NMDC, unit isn't needed. We will require meters.

mslarae13 avatar Jul 26 '22 16:07 mslarae13

start begin

Sorry, that was a typo. The approach advocated by @pbuttigieg would be to have generic fields such as:

  • range start
  • range end
  • unit

For non-range measurements, the range start and range end values would be the same.

wdduncan avatar Jul 26 '22 19:07 wdduncan

Thanks @wdduncan

Recalling the overall goal is to avoid having to write custom code to parse syntax in a data standard (values should be as simple as possible):

I'm actually impartial to whether there are range fields alone or accompanied by a point measurement field. The concern that this would be confusing for some prompted the suggestion of using only range fields and instructing users to enter identical begin/end values.

DwC's verbatim fields are handy for legacy data or data gathered in non-machine-friendly ways (scrawlings in a field notebook, "...the creature was retrieved from about half an arm's length deep")

Further:

As discussed in previous CIG calls on atomisation and improved actionability, as well as at the last board meeting, I would leave out the "unit" field, instead requiring standard units (e.g. meters) in each field.

There is too much variation in the units used, no validation of what's entered, and no stable way to autoconvert between units.

pbuttigieg avatar Jul 27 '22 16:07 pbuttigieg

In some of the software systems I've worked with, the software would automatically set the range end value equal to the ranger start in cases where only a single value was required. I don't think this is a major impetus to having both range start/end fields, but the guidance for how to use them need to be clear.

I think it is reasonable to have the unit field. Not everyone works in units of meters. We may require that the unit come from standardized source, though.

wdduncan avatar Jul 28 '22 15:07 wdduncan