mixs icon indicating copy to clipboard operation
mixs copied to clipboard

update definition of season_temp MIXS:0000643

Open only1chunts opened this issue 4 years ago • 1 comments

Current term details Please supply the current details of the term that you would like to update:

Term name - mean seasonal temperature
Term ID - MIXS:0000643
Structured comment name - season_temp
Definition - Mean seasonal temperature
Expected value - measurement value
Value syntax - {float} {unit}
Example - 18 degree Celsius
Preferred unit - degree Celsius
Package(s) - multiple

Suggested update(s)

Definition - 

Additional context There was a discussion during CIG 27sept2021 that the definition needs to be refined to allow everyone to know what "season" means? is it a season such as Spring, Summer etc, or is it the entire "growing season" of the crop? or something else?!

only1chunts avatar Sep 28 '21 16:09 only1chunts

This term came up recently with an EMSL user trying to fill out metadata. They had similar questions to what was described above.

  • Which season is this referring to? The season in which sampling was performed? Spring? A specific three month period?
  • How is this meant to be calculated? They ended up calculating standard deviation of the average monthly temperatures, which I'm not sure is the goal of this term.

Similar confusion when they tried to fill out season_precpt (https://github.com/GenomicsStandardsConsortium/mixs/issues/253)

bmeluch avatar May 25 '24 00:05 bmeluch

Good issue. Here's a table of the values that have been used in this field for Biosamples retrieved from NCBI on 2025-04-25 (out of 45 Million Biosamples)

season_temp Biosample count
20 °C 368
19.4 °C 362
27.8 °C 274
NA 168
not collected 167
20.7 °C 120
10.85 110
-0.50C 103
14.14C 73
10.21 69
2.07C 63
15,8C 60
9.84 51
28.0 °C 48
21.90C 42
not applicable 28
missing 27
21.1 °C 25
28.3 °C 24
9.02C 24
-0.022C 18
>22 ºC 12
18.2-20.5 °C 10
30 c 3
18-20 C 1
not provided 1

turbomam avatar May 16 '25 13:05 turbomam

Additional sausage making with ChatGPT 4.1

I can run this indexed query in the biosamples_attributes collection of my ncbi_metadata MongoDB.

{"harmonized_name":"season_temp"}

And quickly get about 2 thousand documents with an accession like this

{
  "_id": {
    "$oid": "681b6155636417620ad96c9f"
  },
  "biosample_id": "5862400",
  "accession": "SAMEA4460605",
  "attribute_name": "season_temp",
  "harmonized_name": "season_temp",
  "display_name": "mean seasonal temperature",
  "content": "10.85"
}

I can run this indexed query in the biosamples collection of the same database

{"accession":"SAMEA4460605"}

and quickly get one of the documents that has a season_temp value

how can i fetch all biosamples that have a season_temp value? show a bare mongodb aggregation query. optimize the query to take advantage of the indices I described.

If you want only the biosample documents (not the joined attributes):

db.biosamples_attributes.aggregate([
  { $match: { harmonized_name: "season_temp" } },
  { $lookup: {
      from: "biosamples",
      localField: "accession",
      foreignField: "accession",
      as: "biosample"
  }},
  { $unwind: "$biosample" }, // in case of multiple matches, but usually one
  // Optionally project only the biosample document:
  { $replaceRoot: { newRoot: "$biosample" } }
])

turbomam avatar May 16 '25 13:05 turbomam

I will modify that slightly to return pre-flattened Biosamples, as opposed to nested Biosamples:

[
  { $match: { harmonized_name: "season_temp" } },
  { $lookup: {
      from: "biosamples_flattened",
      localField: "accession",
      foreignField: "accession",
      as: "biosample"
  }},
  { $unwind: "$biosample" }, // in case of multiple matches, but usually one
  // Optionally project only the biosample document:
  { $replaceRoot: { newRoot: "$biosample" } }
]

turbomam avatar May 16 '25 13:05 turbomam

Dropping empty columns and useless rows:

python -c "import pandas as pd, sys; bad = ['NA', 'not collected', 'not applicable', 'missing', 'not provided']; df = pd.read_csv(sys.argv[1]); print('Before:', df.shape); df = df[~df['season_temp'].astype(str).str.strip().isin(bad)] if 'season_temp' in df.columns else df; print('After filter:', df.shape); df = df.dropna(axis=1, how='all'); print('After dropna:', df.shape); df.to_csv(sys.argv[2], index=False)" ncbi-biosamples-with-season_temp.csv ncbi-biosamples-with-season_temp-useful-columns.csv

Before: (2251, 210) After filter: (2029, 210) After dropna: (2029, 200)

turbomam avatar May 16 '25 13:05 turbomam

Next prompt, including the output of the filtering above

My colleagues and I don't understand the implications of the MIxS season_temp field that can be used by INSDC Biosamples. The only remotely helpful annotation in the schema is the description of "Mean seasonal temperature"

We have a lot of questions like "how do we know what season is scope? summer, fall, winter, spring? a growing season?

I have attached all biosample records that have a useful season_temp, with almost all fields (except antibiogram tables and detailed information about the person or organization associated with the Biosample. season_temp values are not considered useful if the are null, an empty string or values like

['NA', 'not collected', 'not applicable', 'missing', 'not provided']

looking at this data, can you infer any more semantics about season_temp? what guidance would you give us about how it could be better documented? like should the values consist of the temperature and a qualifier like "spring: 20 C"?

turbomam avatar May 16 '25 13:05 turbomam

What we can see in the records you provided

observation count comment
total rows 2 029 100 % of the file
rows that have a numeric temperature we could parse 1 860 everything else is free-text like “18-20 °C”, “>22 ºC”, “Not applicable”, etc.
rows that also carry a populated season field (winter / spring / summer / fall) 323 (≈ 16 %) the rest say “Not applicable” or are empty
value range after parsing numbers -0.5 °C … 30 °C median ≈ 19.4 °C, IQR 14.1 – 20.7 °C

Typical payloads look like:

season          season_temp
--------------  -----------
winter          2.07C
spring          14.14C        (sometimes written “15,8C” with a comma)
—not given—     27.8 °C       (single value with unit)
—not given—     18-20 C       (range encoded in one field)
—not given—     >22 ºC        (inequality encoded in one field)

What those patterns imply about season_temp

  • It is always a temperature (°C) but the “season” is only sometimes given.
    Rows that do state the season use the familiar four-season vocabulary, not “growing season” or “rainy season”.

  • When season is missing it does not correlate with month of collection (e.g., samples taken in July may list 10 °C, 20 °C, or be blank). This suggests submitters were asked for a climatological mean rather than an instantaneous measurement.

  • Values cluster around plausible local climate normals (e.g., ~28 °C for Taiwan, ~-0.5 °C for Arctic soils). That supports the reading “mean temperature of the sampling season at the sampling site”.

  • Free-text formatting is common—ranges, inequalities, embedded commas, lower-case “c”, etc.—so consumers have to parse and normalise aggressively.

Recommended documentation tweaks for MIxS

Item Why it matters Proposal
Explicit season 84 % of current records omit it, leaving the temperature uninterpretable. Make season required when season_temp is present. Constrain to an enumerated list: winter, spring, summer, fall, plus (if desired) growing_season or a free-text “other” with a companion field that explains it.
Definition Current text (“Mean seasonal temperature”) does not say of what season. “Mean air temperature (°C) for the declared season at the sampling site, calculated over climatological normals (≥ 10-year average). It is not the temperature on the sampling date.”
Allowed value type Machines cannot reliably parse “18-20 °C” or “>22 ºC”. Constrain to single numeric value in °C (float). Add separate slots if the study needs a min/max range (season_temp_min, season_temp_max).
Unit handling Nearly every record already implies Celsius, but syntax is inconsistent. Make the unit implicit (°C) and forbid embedding it in the value string.
Examples Submitters need a pattern to copy. season: summer season_temp: 27.8 season: winter season_temp: -0.5
Validation hint Helps repositories like INSDC give immediate feedback. Regex such as ^-?\\d+(\\.\\d+)?$ after trimming whitespace; reject if it contains letters, range separators, or inequality signs.

Migration guidance for existing data

  1. Parse what you can: one pass with a tolerant regex (“find the first number”) recovers ≈ 92 % of the values you sent. Flag anything that still contains non-numeric characters for manual review.
  2. Impute the season if collection date is present:
    Northern hemisphere: Dec–Feb = winter, Mar–May = spring, Jun–Aug = summer, Sep–Nov = fall.
    Southern hemisphere: shift by six months.
    Record that the season was inferred so it is clear to downstream users.
  3. Store cleaned data in separate fields (season, season_temp) and keep the original free-text in an season_temp_raw column for provenance.

One-line “style guide” for submitters

season_temp – give a single mean temperature for the season you specify in season, in Celsius, no units or symbols:
Example season = summer, season_temp = 27.8

turbomam avatar May 16 '25 14:05 turbomam

All of the Biosamples that have a season_temp also have a qiita-based sample name. Maybe they are all from EMP500?!

turbomam avatar May 16 '25 14:05 turbomam

BTW: As a reminder "Not applicable" is one of the INSDC missing terms, that are used for mandatory field_name values. And we all know that this is a frustration for trying to make the input cleaner, but that is what is currently implemented at the NCBI, ENA, DDBJ etc.. season=chilli also sprang to mind, sorry...

Woolly-at-EBI avatar May 19 '25 08:05 Woolly-at-EBI

I appreciate your review @Woolly-at-EBI. Quoting https://www.insdc.org/technical-specifications/missing-value-reporting/ verbatim

Please use the above standardised missing value vocabulary only if a true value of an expected format for a mandatory field is missing. If a true value is missing for a recommended or an optional field, then these fields should not be used for reporting at all.

So we don't need to make any allowance for the missing value indicators in this slot.

Having said that, I do think it's time we add an INSDC missing value enumeration to the schema. Would you please review that?

  • https://github.com/GenomicsStandardsConsortium/mixs/pull/955

turbomam avatar May 19 '25 23:05 turbomam