bayeslite
bayeslite copied to clipboard
Sort out confusing mappings between categorical variables
This issue relates to cgpm_metamodel (and more generally, any metamodel supporting categorical variables).
Suppose I have a CGPM x
which outputs integer 0
for False
and integer 1
for True
.
Moreover, I provide a table to bayesdb
_rowid_ | x
--------+------
0 | True
1 | False
2 | True
and specify that x
is CATEGORICAL
. In this instance, the first value that appears is True
so the cgpm mapping is going to be {True: 0, False:1}
, which effectively inverts the results of simulate
and results in extremely confusing output. Even more confusingly, we could (and did) have a situation like the following:
_rowid_ | x | y
--------+--------+------
0 | True | False
1 | False | False
2 | True | False
Now the mapping for x
is going to be {True: 0, False: 1}
, and the mapping for y
is {False:0, True:1}
.
The key issue here is that sometimes the CGPM outputs a variable in which the actual labels have a semantic meaning, whereas other times the CGPM outputs variables in which the labels can be switched without a change in the semantic meaning (like a cluster id for example).
A solution for the case above is to define a BINARY
data type. But for a general categorical
mapping, we probably need another statistical type which indicates the labels mean something (but are not necessarily ordered, like ordinal
, etc).
One solution is to initialize the cgpm-lib with not only the statistical type but also mapping from small integers to categories.