specs icon indicating copy to clipboard operation
specs copied to clipboard

Add decimal place constraint to number fields

Open ezwelty opened this issue 6 years ago • 5 comments

I am rewriting and publishing an existing dataset as a Data Package (https://gitlab.com/ezwelty/glathida), and it includes decimal place limits on several numeric fields. For now, I have enforced this using a pattern constraint:

"pattern": "^\\-?[0-9]*(\\.[0-9]{0,7})?$"

Unfortunately, this practice violates the schema, which currently insists that pattern only apply to post-cast values of string fields (https://github.com/frictionlessdata/specs/issues/428). I understand the complexity that is avoided by the decision, but also regret the huge potential for specificity that is lost. I wish pattern was applied to the field values as stored in the text file (csv, json, or otherwise).

Otherwise, I see no other option than adding a specific decimal place constraint for number fields.

ezwelty avatar Sep 06 '19 23:09 ezwelty

@ezwelty thanks for reporting and if i understand this you are trying to constrain raw data structure or resulting number?

rufuspollock avatar Sep 09 '19 16:09 rufuspollock

@rufuspollock Fundamentally, the number as it is stored in the text file. Once it is read in, the concept of decimal places may be lost (for example, "1.10" could become 1.1 with no knowledge that it was parsed from "1.10"). The original dataset designers wanted to ensure that even if data (e.g. GPS coordinates) were submitted by contributors with absurd numbers of decimal places, they were published rounded to a reasonable number of decimal places.

ezwelty avatar Sep 09 '19 17:09 ezwelty

OK, clear you want parse constraints before casting. Hmmm that seems like it would need something new right ... do you have a suggestion on this that is generic?

rufuspollock avatar Sep 10 '19 12:09 rufuspollock

The simplest I can think of is to allow the pattern constraint on at least all field types with string representations (so everything except JSON data stored as JSON rather than as a string parsable as JSON). The field is read as a string and all values but those in missingValues are tested against pattern before casting to the target type.

So for example, percentages with up to one decimal place (e.g. "95.2%" and "95%"):

{
  "type": "number",
  "decimalChar": ".",
  "bareNumber": false,
  "constraints": {
    "minimum": 0
  }
}

could be more specifically constrained by:

{
  "type": "number",
  "constraints": {
    "pattern": "^[0-9]+(\\.[0-9]{1})?%$"
  }
}

and integer geopoints in the eastern hemisphere stored as a string parsable as a JSON array (e.g. "[90, 45]"):

{
  "type": "geopoint",
  "format": "array"
}

could be more specifically constrained by:

{
  "type": "geopoint",
  "format": "array",
  "constraints": {
    "pattern": "^\\[[0-9]{1,3}, \\-?[0-9]{1,2}\\]$"
  }
}

The one caveat I can think of – the one brought up by @pwalsh in https://github.com/frictionlessdata/specs/issues/428 – is how to deal with JSON data stored as JSON. I presume that JSON values, arrays, and objects could be either read as raw strings or converted back to strings for pattern testing? JSON objects quickly get unwieldy for pattern testing, but it can still be done. For example, integer geopoints in the eastern hemisphere stored as a JSON object:

{
  "type": "geopoint",
  "format": "object",
  "constraints": {
    "pattern": "^\\{\"lon\":\\s*[[0-9]{1,3},\\s*\"lat\":\\s*\\-?[0-9]{1,2}\\}$"
  }
}

ezwelty avatar Sep 10 '19 17:09 ezwelty

As discussed in #879 I think this is important but using regex on numbers seems very wrong.

dafeder avatar Apr 16 '24 15:04 dafeder