modelfox icon indicating copy to clipboard operation
modelfox copied to clipboard

Be able to specify the maximum number of unique values for an enum

Open spullara opened this issue 3 years ago • 2 comments

I am getting text search instead of an enum by default for a column that has 117 unique values (out of the 18k or so samples provided).

spullara avatar Jan 25 '22 02:01 spullara

hi @spullara. You should definitely be able to configure the max unique values so that your column with 117 unique values would be an enum column. Currently, the only way to do that is to pass a config file with the column name, type, and a list of all of the variants. There are two potential implementations that would achieve what you want:

  1. In the config file, allow passing a json object that includes the csv infer options:
#[derive(Clone)]
pub struct FromCsvOptions<'a> {
	pub column_types: Option<BTreeMap<String, TableColumnType>>,
	pub infer_options: InferOptions,
	pub invalid_values: &'a [&'a str],
}

impl<'a> Default for FromCsvOptions<'a> {
	fn default() -> FromCsvOptions<'a> {
		FromCsvOptions {
			column_types: None,
			infer_options: InferOptions::default(),
			invalid_values: DEFAULT_INVALID_VALUES,
		}
	}
}

#[derive(Clone, Debug)]
pub struct InferOptions {
	pub enum_max_unique_values: usize,
}

impl Default for InferOptions {
	fn default() -> InferOptions {
		InferOptions {
			enum_max_unique_values: 100,
		}
	}
}
  1. Allow passing the column name and type but not force the user to pass the all unique variants in a list.

I think option 2 is probably closer to the interface might be looking for? This way you get to configure the type per column but don't have to pass all of the variants (which for enums with high numbers of options is cumbersome).

isabella avatar Jan 26 '22 16:01 isabella

I think just labelling the column an enum without having to list the values would be great.

spullara avatar Jan 27 '22 23:01 spullara