specs
specs copied to clipboard
Allow any type for `true/falseValues`?
For some reason this has been limited to strings only, and there's no actual need for that. There are quite a few cases where boolean values are represented with 0/1 (integers) in the source data.
I propose removing the type restriction here, so that you could specify (as an example):
{
"name": "my_boolean",
"type": "boolean",
"trueValues": [1],
"falseValues": [0]
}
/cc @roll @zelima
In the physical representations of data where boolean values are represented with strings, the values set in trueValues and falseValues are to be cast to their logical representation as booleans. trueValues and falseValues are arrays which can be customised to user need. The default values for these are in the additional properties section below.
Also, this section seems bound to the string representation.
Yes, I know, except a physical representation is not necessarily a string...
The physical representation of data refers to the representation of data as text on disk, for example, in a CSV or JSON file. This representation may have some type information (JSON, where the primitive types that JSON supports can be used) or not (CSV, where all data is represented in string form).
I think this is an error in the spec - there's an issue I opened there.
It took me a while to think about it, and based on the current Table Schema's concept of physical
(I'm not sure that physical is a good word here it's more like textual
) and logical
separation it seems to be that this issue needs to be closed as wontfix
.
Of course, there are use cases when boolean fields represented with integers but strictly speaking if a logical value is an integer it must be marked invalid against a boolean field. By my understanding value substitution is a part of data casting process and we do don't data casting in-general for already typed data.
If the above is wrong I think the change should affect missingValues
as well.
I'll ask WG for a discussion for this issue
The PR https://github.com/frictionlessdata/datapackage/pull/5 allows non-string values. Do I understand this issue is solved then?
Hi @peterdesmet,
Sorry for the confusion. https://github.com/frictionlessdata/datapackage/pull/5 has been reverted and issue returned back to the discussion (I was measled by #864)
If I understand correctly, the suggestion by @roll is to keep requiring the values provided in missingValues
, trueValues
and falseValues
to be strings.
If so, I'm fine with that suggestion.
@roll I think @akariv 's suggestion is preferable
This is actually a very good example on why the distinction between logical and representation (physical, lexical...) values is so important, and how making that distinction would have made this issue very simple to resolve.
Generally we use two types of values in the spec, without distinction - on the one hand, we talk about logical values a lot. For example, a datetime
value points to a specific point in time. A number
will point to a specific point on the real number line. A boolean
can have two distinct logical values, a truthy one and a falsey one. And so on and so forth.
The representation of these values in data files that may be described by a data package, might also vary. While we're used to thinking about csv files, where the issues are usually dates with various formats or the decimal character of a number, other data formats use different data types - i.e. not strings - to represent data. For example, Excel might use an integer to represent dates or booleans. JSON files will have native boolean values, but still needs help when dates need to be decoded.
In the spec we sometimes refer to the logical value - for example, when defining constraints for max value of a number, we don't care how the it was represented, only it's logical value. In other cases, we refer to the representation of the value - for example, when defining the 'missingValues' field, we will declare which representations of values should be ignored.
What we need to do is to specify in every location where a value is to be given whether it's a logical value or a representation value, and have the same rules apply for all instances of the same kind. Obviously, this specific issue requires a representation value, so if we decide that we allow for excel files to be described by data packages I think the conclusion should be pretty clear on what should be the correct solution here :)
@akariv I appreciate your distinctions & definitions here. I think making the distinction between logical and representation values is especially salient re: value labels in categorical / ordinal types (as discussed in #844).
Conceptually, a boolean logical type with true/falseValues is a special case of a ordinal / categorical logical type with value labels (i.e. a binary categorical variable with logical values "true" and "false").
So as we consider the approach & language to adopt here with boolean types in distinguishing logical vs representations (i.e. label vs underlying representation values), I think it would be good to be thinking about how the decisions here generalize to categorical / ordinal types and value labels so we can keep the approach / language there consistent.
Right now the categorical extension defines a mapping between representation and label in the enumLabels
where the representation values are all strings (which is consistent with how strings are provided for missingValues
, trueValues
, and falseValues
). Allowing any type for the keys enumLabels
opens a can of worms, because as a json object the keys need be strings.
Another potential point of logical / representation confusion is labels on missing values. If the logical missing label "PARTICIPANT_SKIPPED_ITEM" is represented in the data as -999, does should we define missingValues = [-999] or missingValues = ["PARTICIPANT_SKIPPED_ITEM"]?
I'm 100% with you in the idea that making the distinction between logical & representation values from the get-go would have made this issue (and the value labels situation by extension) easy to solve. I think the challenge here is to balance conceptual correctness / internal consistency / historical compatibility. So rather than doing the surgery that would be required across the spec to represent excel files or other binary formats "correctly", I think it might be easier to just let a limitation of the datapackage format be that it is designed for underlying textual data representations... That way, "representation values" (e.g. missingValues
, trueValues
, falseValues
, enumLabels
keys) are always strings.
Not ideal, I know (especially for representing floating point values!). But given that we're not allowing breaking changes in V2, it kind of limits our ability for deep surgery... I'm open to more thoughts / brainstorm though!