mixs icon indicating copy to clipboard operation
mixs copied to clipboard

`cur_land_use` was an enum before LinkML

Open mslarae13 opened this issue 7 months ago • 9 comments

Something that I think was missed during the migration to LinkML

cur_land_use (https://genomicsstandardsconsortium.github.io/mixs/0001080/) has Range: [String](https://genomicsstandardsconsortium.github.io/mixs/String/)

However, it also has

string_serialization: '[cities|farmstead|industrial areas|roads/railroads|rock|sand|gravel|mudflats|salt flats|badlands|permanent snow or ice|saline seeps|mines/quarries|oil waste areas|small grains|row crops|vegetable crops|horticultural plants (e.g. tulips)|marshlands (grass,sedges,rushes)|tundra (mosses,lichens)|rangeland|pastureland (grasslands used for livestock grazing)|hayland|meadows (grasses,alfalfa,fescue,bromegrass,timothy)|shrub land (e.g. mesquite,sage-brush,creosote bush,shrub oak,eucalyptus)|successional shrub land (tree saplings,hazels,sumacs,chokecherry,shrub dogwoods,blackberries)|shrub crops (blueberries,nursery ornamentals,filberts)|vine crops (grapes)|conifers (e.g. pine,spruce,fir,cypress)|hardwoods (e.g. oak,hickory,elm,aspen)|intermixed hardwood and conifers|tropical (e.g. mangrove,palms)|rainforest (evergreen forest receiving >406 cm annual rainfall)|swamp (permanent or semi-permanent water body dominated by woody plants)|crop trees (nuts,fruit,christmas trees,nursery trees)]'

I suspect this should've been an enum. But an enum couldn't be made directly, as you should include examples in the enum permissible values

See NMDC's update: https://microbiomedata.github.io/nmdc-schema/CurLandUseEnum/

MIxS should also provide this enum, and provide the examples and information in () in an attribute of the enum.

FYI @turbomam

mslarae13 avatar May 26 '25 21:05 mslarae13

related to https://github.com/GenomicsStandardsConsortium/mixs/issues/905

https://github.com/GenomicsStandardsConsortium/mixs/issues/905#issuecomment-2821326758

mslarae13 avatar May 26 '25 21:05 mslarae13

Yes, the dreaded string_serialization comes from the 'Value syntax' in https://github.com/GenomicsStandardsConsortium/mixs6.2_release_candidate/blob/main/GSC-excel-harmonized-repaired/mixs_v6.xlsx.harmonized.tsv

I agree that I should be an enumeration with examples and will work on that now

turbomam avatar May 27 '25 13:05 turbomam

See also

  • #905
  • #373
  • #333

turbomam avatar May 27 '25 13:05 turbomam

PS the description isn't very good either

Present state of sample site

There are many states that a site could be in beyond the what in which the land is being used

turbomam avatar May 27 '25 13:05 turbomam

And is this supposed to be aligned with any other system?

  • https://www.fao.org/geospatial/resources/detail/en/c/1024744/?utm_source=chatgpt.com
  • https://land.copernicus.eu/content/corine-land-cover-nomenclature-guidelines/html/?utm_source=chatgpt.com
  • https://land.copernicus.eu/en/products/corine-land-cover?utm_source=chatgpt.com
  • https://www.fao.org/4/x0596e/x0596e01f.htm?utm_source=chatgpt.com
  • https://www.fao.org/land-water/land/land-governance/land-resources-planning-toolbox/category/details/en/c/1036361/?utm_source=chatgpt.com
  • https://www.nrcs.usda.gov/sites/default/files/2022-09/EQIP_Land_Eligibility_and_NPPH_Land_Use_Chart.pdf?utm_source=chatgpt.com
  • https://www.nrcs.usda.gov/conservation-basics/natural-resource-concerns/land-use?utm_source=chatgpt.com

turbomam avatar May 27 '25 13:05 turbomam

Is this aligned with EnvO

turbomam avatar May 27 '25 14:05 turbomam

To what degree should MIxS capture the fact that https://en.wikipedia.org/wiki/Pinus_pinaster aka http://purl.obolibrary.org/obo/NCBITaxon_71647 is a "conifer" ? Is that word synonymous with https://en.wikipedia.org/wiki/Pinales? or https://en.wikipedia.org/wiki/Pinidae ?

See also https://en.wikipedia.org/wiki/Conifer

turbomam avatar May 27 '25 14:05 turbomam

table of frequently used values from INSDC Biosamples

ncbi_metadata.biosamples_attributes.cur_land_use.csv

turbomam avatar May 27 '25 14:05 turbomam

From ChatGPT

Comparison of INSDC/NCBI Biosample cur_land_use Values with Pipe-Separated List and LinkML Enumeration

Observations:

1. Direct Matches:

Some of the values in the INSDC biosample data are direct matches to categories from both the pipe-separated list and LinkML enumeration. Examples include:

  • row crops, pastureland (grasslands used for livestock grazing), mines/quarries, sand, crop trees (nuts, fruit, christmas trees, nursery trees), conifers (e.g. pine, spruce, fir, cypress), and marshlands (grass, sedges, rushes).

2. Category Variations:

Some categories in the INSDC biosample data have slight variations in terminology but align with categories in the enumeration. Examples include:

  • agriculture, agricultural, and crop which could be aligned with row crops, vegetable crops, or crop trees.
  • grass/herbaceous cover and grassland could relate to meadows or pastureland.
  • agricultural experiment and arable field might fit with row crops or small grains.

3. Unaccounted Values:

There are several land use categories in the biosample data that do not appear directly in either the pipe-separated list or the LinkML enumeration. These may represent specific or unique land uses. For example:

  • abandoned grassland, arable cropland for long-term experimentation, no-till system, temperate coniferous forest, National Park, fertilized meadow, and unfertilized pasture grazed by cattle.

4. Overlapping Terms:

Some terms have overlapping meaning but differ slightly in phrasing:

  • shrub land (e.g. mesquite, sage-brush, creosote bush, shrub oak, eucalyptus) and shrub crops (blueberries, nursery ornamentals, filberts) align with shrub land and shrub crops in the enumeration, though with additional specificity in the biosample data.

Summary of Key Comparisons:

High Match:

Terms like row crops, industrial areas, conifers, hardwoods, pastureland, marshlands, vine crops, and shrub land match directly or closely with both the pipe-separated list and LinkML enumeration.

Moderate Match:

Terms such as agriculture, agricultural, and crop are commonly found in the biosample data but appear more generally in the enumeration (e.g., under row crops, vegetable crops, crop trees).

Missing/Unique:

The biosample data contains specific land use terms not covered by either list (e.g., abandoned grassland, temperate coniferous forest, fertilized meadow, no-till system, National Park).


Suggestions:

To better align with the biosample data:

  1. Extend the LinkML enumeration to include terms like agriculture, arable field, fertilized meadow, unfertilized pasture, and National Park that appear in the biosample counts.
  2. Adjust terminology: Consider consolidating overlapping terms (e.g., grassland, grass/herbaceous cover, and meadows) to reduce ambiguity and improve consistency across data sources.
  3. Handle variations: Allow for aliases or synonyms in the LinkML model to capture terms like agricultural experiment or temperate woodland that are used less commonly.

turbomam avatar May 27 '25 15:05 turbomam