Include dimension names is variable specifications
Related to #868 and #893
I'm wondering if we can start to check dimension names of variables without being too heavy handed by using warnings. This could potentially avoid manual comparison of array shapes and avoid issues like #868. It could also add make the variable documentation a bit more useful.
I'd suggest that we add dimension names as part of each variables specification. Outer dimensions could be optional, and a wildcard could be used when a dimension is not strictly defined. Dimension names could then be checked within the validate_variables function and a DimensionWarning thrown if there is a mismatch.
I also noticed that some dimension constants where initially created here, but have mostly been specified using raw strings in other files.
Here's a rough idea of what the dimension checking code could look like:
import warnings
class DimensionWarning(UserWarning):
"Warning about dimension mismatches."
pass
DIM_OPTION = None
DIM_WILDCARD = "*"
DIM_VARIANT = "variants"
DIM_WINDOWS = "windows"
DIM_CONTIG = "contigs"
DIM_SAMPLE = "samples"
DIM_COHORT = "cohorts"
DIM_PLOIDY = "ploidy"
DIM_ALLELE = "alleles"
DIM_GENOTYPE = "genotypes"
DIM_FILTER = "filters"
DIMS_POPULATION = {DIM_COHORT, DIM_SAMPLE}
DIMS_GENOMIC = {DIM_CONTIG, DIM_WINDOWS, DIM_VARIANT}
def dimension_match(dim, spec):
if isinstance(spec, set):
return (dim in spec) or (DIM_WILDCARD in spec)
else:
return (dim == spec) or (DIM_WILDCARD == spec)
def validate_dims(dims, spec):
message = "Dimensions {} do not match {}".format(dims, spec)
n_obs, n_exp = len(dims), len(spec)
if n_obs > n_exp:
warnings.warn(message, DimensionWarning)
diff = n_exp - n_obs
for i in range(diff):
if None not in spec[i]:
warnings.warn(message, DimensionWarning)
for dim, exp in zip(dims, spec[diff:]):
if not dimension_match(dim, exp):
warnings.warn(message, DimensionWarning)
# no warning
validate_dims(("windows", "samples", "ploidy"), (DIMS_GENOMIC, DIM_SAMPLE, DIM_PLOIDY))
# no warning due to wildcard
validate_dims(("unknown", "samples", "ploidy"), (DIM_WILDCARD, DIM_SAMPLE, DIM_PLOIDY))
# no warning due to option
validate_dims(("samples", "ploidy"), (DIMS_GENOMIC | {DIM_OPTION}, DIM_SAMPLE, DIM_PLOIDY))
# warning due to first dimension mismatch
validate_dims(("unknown", "samples", "ploidy"), (DIMS_GENOMIC, DIM_SAMPLE, DIM_PLOIDY))
I like this idea.
I also noticed that some dimension constants where initially created here, but have mostly been specified using raw strings in other files.
I think we decided raw strings are a bit more readable and any typos would cause an error quickly enough.