soilDB icon indicating copy to clipboard operation
soilDB copied to clipboard

encoding factors in fetch* functions

Open dylanbeaudette opened this issue 2 years ago • 1 comments

A couple of thoughts:

  • there will only be an expectation / possibility of encoding factors in high-level functions such as fetchNASIS, fetchSDA, etc.
  • split functionality for further / better customization
  • uncode() performs the de-coding of values in NASIS, using the latest version of the metadata table from the local database (if possible)
  • a new function or suite of functions would convert specific variables to factors, set the desired levels, and upgrade to ordered factors when appropriate. these functions will include an argument for dropping unused levels
  • behavior of this "second-pass" over the uncoded data can be controlled via argument to fetchNASIS or global preference set with option()

Chain of functionality:

  1. read data as text
  2. uncode() all coded columns, optionally converting to factors using metadata (`encodeFactors='all')
  3. selective encoding (encodeFactors='some') or none (encodeFactors='none')
  4. ???
# in all functions get data from NASIS
x<- query()
y <- uncode(x, encodeFactors)
return(y)
# high level functions like fetchNASIS()
x <- getXXX_from_NASIS(encodeFactors)
if(encodeFactors='some') {
  .setupNASIS_factors(...)
}

Apart from the compatibility issue with a pending version of R, there are no reasons why we can't all get what we want out of NASIS. The factor-conversion code can be written to look for NASIS column names, and encode levels according to either the metadata or a manually-specified vector. An invert argument can be added to reverse factor levels which is sometimes handy. That said, I don't think that we should attempt to convert all character data → factors (e.g. parent material origin) by default, just those that are most commonly used as factors (texture class, hillslope position, drainage class, etc.).

The new function / functions will likely be internal to soilDB, and will "know" how to exclude IDs.

# x: data.frame
# all: encode all character data, or just those manually defined in the function
# invert: invert factor levels / ordering
# drop: drop unused levels
.setupNASIS_factors <- function(x, all = FALSE, invert = FALSE, drop = TRUE) {
  
  # all = TRUE
  # use NASIS metadata

  # all = FALSE
  # use column-specific rules as follows
  # ...
  
  # drop = TRUE
  # drop unused levels, no matter the encoding strategy above

  # modified data.frame is returned
  return(res)
}

Finally, I suggest that fetchNASIS() should default to:

  • convert most of the commonly used nominal / ordinal data to factors / ordered factors
  • this would exclude such things as IDs, date/time, names, taxonomic information, or cases with >n unique values
  • factor levels should be set manually in the to-be-written function whenever possible
  • unused levels should be dropped

dylanbeaudette avatar Mar 08 '22 19:03 dylanbeaudette

I added two new domain attributes to the query used by uncode() (in .get_NASIS_metadata()) for use in future functions.

MetadataDomainMaster.DomainRanked

  • Of the 439 domains in NASIS metadata, 143 are "ranked" where MetadataDomainDetail.ChoiceSequence denotes the order. Note that the uncode() query orders result by ChoiceValue, not ChoiceSequence, by default.
capability_class, corrosion_concrete, corrosion_uncoated_steel, flooding_duration_class, flooding_ponding_month, potential_frost_action, soil_erodibility_factor, wind_erodibility_index, drainage_class, excavation_difficulty_class, soil_slippage_potential, ponding_duration_class, pore_continuity_vertical, rupture_resist_block_cem, wildlife_rating, mapunit_hel_class, flooding_frequency_class, ponding_frequency_class, date_time_interval_qualifier, erosion_class, fl_soil_leaching_potential, fl_soil_runoff_potential, runoff, taxonomic_family_c_e_act_class, va_soil_management_group, va_soil_productivity_group, bedrock_fracture_interval_class, boundary_distinctness, color_chroma, color_value, concen_redox_boundary, effervescence_class, concen_rmf_mottle_contrast, penetration_resistance, permeability_class, plasticity, pore_root_size, pvsf_distinctness, rupture_resist_block_dry, rupture_resist_block_moist, rupture_resist_plate, stickiness, structure_grade, structure_size, toughness_class, weathering, dmu_investigation_intensity, soil_taxonomy_edition, ia_subsoil_k, ia_subsoil_p, nj_farmland_assessment, Datetime Precision (NASIS 6 Metadata), sat_hyd_conductivity_class, soil_odor_intensity, texture_structure_category, crust_development_class, carbonate_dev_stage_cf, carbonate_dev_stage_fe, pore_quantity_class, abundance_class, canopy_cover_class, cryptogam_cover_class_legacy, cultivation_extent, current_year_precip, damage_degree, daubenmire_canopy_cover_class, decadent_plant_abundance, disturbance_impact, forest_stand_quality, ground_cover_class, ground_cover_extent, growing_season_rating, gully_rill_presence, invading_plants, pci_concentration_areas, pci_desirable_plants, pci_ground_cover_residue, pci_gully_erosion, pci_legume_pct_class, pci_plant_cover, pci_plant_diversity, pci_plant_vigor, pci_sheet_rill_erosion, pci_soil_compaction, pci_standing_dead_forage, pci_stream_shore_erosion, pci_use_uniformity, pci_wind_erosion, plant_density_class, reference_yield_rank, reproduction_abundance_class, rhi_annual_production, rhi_bare_ground, rhi_compaction_layer, rhi_erosion_resistance, rhi_functional_struct_groups, rhi_gullies, rhi_infiltration_runoff, rhi_invasive_plants, rhi_litter_amount, rhi_litter_movement, rhi_pedestals_terracettes, rhi_plant_mortality, rhi_reproductive_capability, rhi_rills, rhi_soil_surf_degradation, rhi_summary, rhi_water_flow_patterns, rhi_wind_scour_areas, salinity_class, sampling_intensity, seedling_abundance, sociability_class, soil_compaction, soil_crusting, soil_degradation, soil_surface_erosion, stocking_rate, suppression_degree, tree_condition, vigor_class, ak_ecological_site_status, ak_stratum_cover_class, ak_functional_group, ak_crown_class, ak_grazing_plant_group, rosgen_stream_subclass, ak_grazing_impact, observation_intensity, von_post_humification_scale, osd_text_kind, burn_intensity, crop_arrangement, dominant_vegetation, growth_status, harvest_skidding_method, type_of_burn, years_in, yrs_since_harvest, yrs_since_last_burn, burn_frequemcy, fertility_tests_done, dsp_site_type

See for example ponding frequency class the ChoiceSequence is not the same as the ChoiceValue. Notably the ordering includes the obsolete values. In this case the obsolete class "Common" has a value (5) that does not match sequence position (4) in the set.

image


MetadataDomainMaster.DisplayLabel

  • 30 of the 439 domains have DisplayLabel value of 1 which means that the ChoiceLabel could/should be used rather than the ChoiceName; which is generally a difference of capitalization.
hydric_condition, nasis_site_office_type, farmland_classification, state_fips_code_alpha, texture_class, texture_modifier, unified_soil_classification, terms_used_in_lieu_of_texture, mapunit_hel_class, erosion_class, nh_important_forest_soil_group, logical_data_type_nasis, sort_type, site_index_curves, legend_suitability_for_use, mou_agency_responsible, ecological_site_mlra, mapunit_text_kind, legend_certification_status, dmu_certification_status, export_certification_status, hydric_soil_indicator, farmland_class_secondary, mapunit_type, cardinality_nasis, column_alignment, default_type, saf_cover_type, sort_direction, soil_type_conversion

brownag avatar Mar 10 '22 01:03 brownag