soilDB
soilDB copied to clipboard
encoding factors in fetch* functions
A couple of thoughts:
- there will only be an expectation / possibility of encoding factors in high-level functions such as
fetchNASIS
,fetchSDA
, etc. - split functionality for further / better customization
-
uncode()
performs the de-coding of values in NASIS, using the latest version of the metadata table from the local database (if possible) - a new function or suite of functions would convert specific variables to factors, set the desired levels, and upgrade to ordered factors when appropriate. these functions will include an argument for dropping unused levels
- behavior of this "second-pass" over the uncoded data can be controlled via argument to
fetchNASIS
or global preference set withoption()
Chain of functionality:
- read data as text
-
uncode()
all coded columns, optionally converting to factors using metadata (`encodeFactors='all') - selective encoding (
encodeFactors='some') or none (
encodeFactors='none') - ???
# in all functions get data from NASIS
x<- query()
y <- uncode(x, encodeFactors)
return(y)
# high level functions like fetchNASIS()
x <- getXXX_from_NASIS(encodeFactors)
if(encodeFactors='some') {
.setupNASIS_factors(...)
}
Apart from the compatibility issue with a pending version of R, there are no reasons why we can't all get what we want out of NASIS. The factor-conversion code can be written to look for NASIS column names, and encode levels according to either the metadata or a manually-specified vector. An invert
argument can be added to reverse factor levels which is sometimes handy. That said, I don't think that we should attempt to convert all character data → factors (e.g. parent material origin) by default, just those that are most commonly used as factors (texture class, hillslope position, drainage class, etc.).
The new function / functions will likely be internal to soilDB, and will "know" how to exclude IDs.
# x: data.frame
# all: encode all character data, or just those manually defined in the function
# invert: invert factor levels / ordering
# drop: drop unused levels
.setupNASIS_factors <- function(x, all = FALSE, invert = FALSE, drop = TRUE) {
# all = TRUE
# use NASIS metadata
# all = FALSE
# use column-specific rules as follows
# ...
# drop = TRUE
# drop unused levels, no matter the encoding strategy above
# modified data.frame is returned
return(res)
}
Finally, I suggest that fetchNASIS()
should default to:
- convert most of the commonly used nominal / ordinal data to factors / ordered factors
- this would exclude such things as IDs, date/time, names, taxonomic information, or cases with
>n
unique values - factor levels should be set manually in the to-be-written function whenever possible
- unused levels should be dropped
I added two new domain attributes to the query used by uncode()
(in .get_NASIS_metadata()
) for use in future functions.
MetadataDomainMaster.DomainRanked
- Of the 439 domains in NASIS metadata, 143 are "ranked" where
MetadataDomainDetail.ChoiceSequence
denotes the order. Note that theuncode()
query orders result byChoiceValue
, notChoiceSequence
, by default.
capability_class, corrosion_concrete, corrosion_uncoated_steel, flooding_duration_class, flooding_ponding_month, potential_frost_action, soil_erodibility_factor, wind_erodibility_index, drainage_class, excavation_difficulty_class, soil_slippage_potential, ponding_duration_class, pore_continuity_vertical, rupture_resist_block_cem, wildlife_rating, mapunit_hel_class, flooding_frequency_class, ponding_frequency_class, date_time_interval_qualifier, erosion_class, fl_soil_leaching_potential, fl_soil_runoff_potential, runoff, taxonomic_family_c_e_act_class, va_soil_management_group, va_soil_productivity_group, bedrock_fracture_interval_class, boundary_distinctness, color_chroma, color_value, concen_redox_boundary, effervescence_class, concen_rmf_mottle_contrast, penetration_resistance, permeability_class, plasticity, pore_root_size, pvsf_distinctness, rupture_resist_block_dry, rupture_resist_block_moist, rupture_resist_plate, stickiness, structure_grade, structure_size, toughness_class, weathering, dmu_investigation_intensity, soil_taxonomy_edition, ia_subsoil_k, ia_subsoil_p, nj_farmland_assessment, Datetime Precision (NASIS 6 Metadata), sat_hyd_conductivity_class, soil_odor_intensity, texture_structure_category, crust_development_class, carbonate_dev_stage_cf, carbonate_dev_stage_fe, pore_quantity_class, abundance_class, canopy_cover_class, cryptogam_cover_class_legacy, cultivation_extent, current_year_precip, damage_degree, daubenmire_canopy_cover_class, decadent_plant_abundance, disturbance_impact, forest_stand_quality, ground_cover_class, ground_cover_extent, growing_season_rating, gully_rill_presence, invading_plants, pci_concentration_areas, pci_desirable_plants, pci_ground_cover_residue, pci_gully_erosion, pci_legume_pct_class, pci_plant_cover, pci_plant_diversity, pci_plant_vigor, pci_sheet_rill_erosion, pci_soil_compaction, pci_standing_dead_forage, pci_stream_shore_erosion, pci_use_uniformity, pci_wind_erosion, plant_density_class, reference_yield_rank, reproduction_abundance_class, rhi_annual_production, rhi_bare_ground, rhi_compaction_layer, rhi_erosion_resistance, rhi_functional_struct_groups, rhi_gullies, rhi_infiltration_runoff, rhi_invasive_plants, rhi_litter_amount, rhi_litter_movement, rhi_pedestals_terracettes, rhi_plant_mortality, rhi_reproductive_capability, rhi_rills, rhi_soil_surf_degradation, rhi_summary, rhi_water_flow_patterns, rhi_wind_scour_areas, salinity_class, sampling_intensity, seedling_abundance, sociability_class, soil_compaction, soil_crusting, soil_degradation, soil_surface_erosion, stocking_rate, suppression_degree, tree_condition, vigor_class, ak_ecological_site_status, ak_stratum_cover_class, ak_functional_group, ak_crown_class, ak_grazing_plant_group, rosgen_stream_subclass, ak_grazing_impact, observation_intensity, von_post_humification_scale, osd_text_kind, burn_intensity, crop_arrangement, dominant_vegetation, growth_status, harvest_skidding_method, type_of_burn, years_in, yrs_since_harvest, yrs_since_last_burn, burn_frequemcy, fertility_tests_done, dsp_site_type
See for example ponding frequency class the ChoiceSequence
is not the same as the ChoiceValue
. Notably the ordering includes the obsolete values. In this case the obsolete class "Common" has a value (5
) that does not match sequence position (4
) in the set.
MetadataDomainMaster.DisplayLabel
- 30 of the 439 domains have DisplayLabel value of 1 which means that the
ChoiceLabel
could/should be used rather than theChoiceName
; which is generally a difference of capitalization.
hydric_condition, nasis_site_office_type, farmland_classification, state_fips_code_alpha, texture_class, texture_modifier, unified_soil_classification, terms_used_in_lieu_of_texture, mapunit_hel_class, erosion_class, nh_important_forest_soil_group, logical_data_type_nasis, sort_type, site_index_curves, legend_suitability_for_use, mou_agency_responsible, ecological_site_mlra, mapunit_text_kind, legend_certification_status, dmu_certification_status, export_certification_status, hydric_soil_indicator, farmland_class_secondary, mapunit_type, cardinality_nasis, column_alignment, default_type, saf_cover_type, sort_direction, soil_type_conversion