update definition of season_temp MIXS:0000643
Current term details Please supply the current details of the term that you would like to update:
Term name - mean seasonal temperature
Term ID - MIXS:0000643
Structured comment name - season_temp
Definition - Mean seasonal temperature
Expected value - measurement value
Value syntax - {float} {unit}
Example - 18 degree Celsius
Preferred unit - degree Celsius
Package(s) - multiple
Suggested update(s)
Definition -
Additional context There was a discussion during CIG 27sept2021 that the definition needs to be refined to allow everyone to know what "season" means? is it a season such as Spring, Summer etc, or is it the entire "growing season" of the crop? or something else?!
This term came up recently with an EMSL user trying to fill out metadata. They had similar questions to what was described above.
- Which season is this referring to? The season in which sampling was performed? Spring? A specific three month period?
- How is this meant to be calculated? They ended up calculating standard deviation of the average monthly temperatures, which I'm not sure is the goal of this term.
Similar confusion when they tried to fill out season_precpt (https://github.com/GenomicsStandardsConsortium/mixs/issues/253)
Good issue. Here's a table of the values that have been used in this field for Biosamples retrieved from NCBI on 2025-04-25 (out of 45 Million Biosamples)
| season_temp | Biosample count |
|---|---|
| 20 °C | 368 |
| 19.4 °C | 362 |
| 27.8 °C | 274 |
| NA | 168 |
| not collected | 167 |
| 20.7 °C | 120 |
| 10.85 | 110 |
| -0.50C | 103 |
| 14.14C | 73 |
| 10.21 | 69 |
| 2.07C | 63 |
| 15,8C | 60 |
| 9.84 | 51 |
| 28.0 °C | 48 |
| 21.90C | 42 |
| not applicable | 28 |
| missing | 27 |
| 21.1 °C | 25 |
| 28.3 °C | 24 |
| 9.02C | 24 |
| -0.022C | 18 |
>22 ºC |
12 |
| 18.2-20.5 °C | 10 |
| 30 c | 3 |
| 18-20 C | 1 |
| not provided | 1 |
Additional sausage making with ChatGPT 4.1
I can run this indexed query in the biosamples_attributes collection of my ncbi_metadata MongoDB.
{"harmonized_name":"season_temp"}
And quickly get about 2 thousand documents with an accession like this
{
"_id": {
"$oid": "681b6155636417620ad96c9f"
},
"biosample_id": "5862400",
"accession": "SAMEA4460605",
"attribute_name": "season_temp",
"harmonized_name": "season_temp",
"display_name": "mean seasonal temperature",
"content": "10.85"
}
I can run this indexed query in the biosamples collection of the same database
{"accession":"SAMEA4460605"}
and quickly get one of the documents that has a season_temp value
how can i fetch all biosamples that have a season_temp value? show a bare mongodb aggregation query. optimize the query to take advantage of the indices I described.
If you want only the biosample documents (not the joined attributes):
db.biosamples_attributes.aggregate([
{ $match: { harmonized_name: "season_temp" } },
{ $lookup: {
from: "biosamples",
localField: "accession",
foreignField: "accession",
as: "biosample"
}},
{ $unwind: "$biosample" }, // in case of multiple matches, but usually one
// Optionally project only the biosample document:
{ $replaceRoot: { newRoot: "$biosample" } }
])
I will modify that slightly to return pre-flattened Biosamples, as opposed to nested Biosamples:
[
{ $match: { harmonized_name: "season_temp" } },
{ $lookup: {
from: "biosamples_flattened",
localField: "accession",
foreignField: "accession",
as: "biosample"
}},
{ $unwind: "$biosample" }, // in case of multiple matches, but usually one
// Optionally project only the biosample document:
{ $replaceRoot: { newRoot: "$biosample" } }
]
Dropping empty columns and useless rows:
python -c "import pandas as pd, sys; bad = ['NA', 'not collected', 'not applicable', 'missing', 'not provided']; df = pd.read_csv(sys.argv[1]); print('Before:', df.shape); df = df[~df['season_temp'].astype(str).str.strip().isin(bad)] if 'season_temp' in df.columns else df; print('After filter:', df.shape); df = df.dropna(axis=1, how='all'); print('After dropna:', df.shape); df.to_csv(sys.argv[2], index=False)" ncbi-biosamples-with-season_temp.csv ncbi-biosamples-with-season_temp-useful-columns.csv
Before: (2251, 210) After filter: (2029, 210) After dropna: (2029, 200)
Next prompt, including the output of the filtering above
My colleagues and I don't understand the implications of the MIxS season_temp field that can be used by INSDC Biosamples. The only remotely helpful annotation in the schema is the description of "Mean seasonal temperature"
We have a lot of questions like "how do we know what season is scope? summer, fall, winter, spring? a growing season?
I have attached all biosample records that have a useful season_temp, with almost all fields (except antibiogram tables and detailed information about the person or organization associated with the Biosample. season_temp values are not considered useful if the are null, an empty string or values like
['NA', 'not collected', 'not applicable', 'missing', 'not provided']
looking at this data, can you infer any more semantics about season_temp? what guidance would you give us about how it could be better documented? like should the values consist of the temperature and a qualifier like "spring: 20 C"?
What we can see in the records you provided
| observation | count | comment |
|---|---|---|
| total rows | 2 029 | 100 % of the file |
| rows that have a numeric temperature we could parse | 1 860 | everything else is free-text like “18-20 °C”, “>22 ºC”, “Not applicable”, etc. |
rows that also carry a populated season field (winter / spring / summer / fall) |
323 (≈ 16 %) | the rest say “Not applicable” or are empty |
| value range after parsing numbers | -0.5 °C … 30 °C | median ≈ 19.4 °C, IQR 14.1 – 20.7 °C |
Typical payloads look like:
season season_temp
-------------- -----------
winter 2.07C
spring 14.14C (sometimes written “15,8C” with a comma)
—not given— 27.8 °C (single value with unit)
—not given— 18-20 C (range encoded in one field)
—not given— >22 ºC (inequality encoded in one field)
What those patterns imply about season_temp
-
It is always a temperature (°C) but the “season” is only sometimes given.
Rows that do state the season use the familiar four-season vocabulary, not “growing season” or “rainy season”. -
When season is missing it does not correlate with month of collection (e.g., samples taken in July may list 10 °C, 20 °C, or be blank). This suggests submitters were asked for a climatological mean rather than an instantaneous measurement.
-
Values cluster around plausible local climate normals (e.g., ~28 °C for Taiwan, ~-0.5 °C for Arctic soils). That supports the reading “mean temperature of the sampling season at the sampling site”.
-
Free-text formatting is common—ranges, inequalities, embedded commas, lower-case “c”, etc.—so consumers have to parse and normalise aggressively.
Recommended documentation tweaks for MIxS
| Item | Why it matters | Proposal |
|---|---|---|
| Explicit season | 84 % of current records omit it, leaving the temperature uninterpretable. | Make season required when season_temp is present. Constrain to an enumerated list: winter, spring, summer, fall, plus (if desired) growing_season or a free-text “other” with a companion field that explains it. |
| Definition | Current text (“Mean seasonal temperature”) does not say of what season. | “Mean air temperature (°C) for the declared season at the sampling site, calculated over climatological normals (≥ 10-year average). It is not the temperature on the sampling date.” |
| Allowed value type | Machines cannot reliably parse “18-20 °C” or “>22 ºC”. | Constrain to single numeric value in °C (float). Add separate slots if the study needs a min/max range (season_temp_min, season_temp_max). |
| Unit handling | Nearly every record already implies Celsius, but syntax is inconsistent. | Make the unit implicit (°C) and forbid embedding it in the value string. |
| Examples | Submitters need a pattern to copy. | season: summer season_temp: 27.8 season: winter season_temp: -0.5 |
| Validation hint | Helps repositories like INSDC give immediate feedback. | Regex such as ^-?\\d+(\\.\\d+)?$ after trimming whitespace; reject if it contains letters, range separators, or inequality signs. |
Migration guidance for existing data
- Parse what you can: one pass with a tolerant regex (“find the first number”) recovers ≈ 92 % of the values you sent. Flag anything that still contains non-numeric characters for manual review.
- Impute the season if collection date is present:
Northern hemisphere: Dec–Feb = winter, Mar–May = spring, Jun–Aug = summer, Sep–Nov = fall.
Southern hemisphere: shift by six months.
Record that the season was inferred so it is clear to downstream users. - Store cleaned data in separate fields (
season,season_temp) and keep the original free-text in anseason_temp_rawcolumn for provenance.
One-line “style guide” for submitters
season_temp – give a single mean temperature for the season you specify in
season, in Celsius, no units or symbols:
Exampleseason = summer,season_temp = 27.8
All of the Biosamples that have a season_temp also have a qiita-based sample name. Maybe they are all from EMP500?!
BTW: As a reminder "Not applicable" is one of the INSDC missing terms, that are used for mandatory field_name values. And we all know that this is a frustration for trying to make the input cleaner, but that is what is currently implemented at the NCBI, ENA, DDBJ etc.. season=chilli also sprang to mind, sorry...
I appreciate your review @Woolly-at-EBI. Quoting https://www.insdc.org/technical-specifications/missing-value-reporting/ verbatim
Please use the above standardised missing value vocabulary only if a true value of an expected format for a mandatory field is missing. If a true value is missing for a recommended or an optional field, then these fields should not be used for reporting at all.
So we don't need to make any allowance for the missing value indicators in this slot.
Having said that, I do think it's time we add an INSDC missing value enumeration to the schema. Would you please review that?
- https://github.com/GenomicsStandardsConsortium/mixs/pull/955