bayeslite icon indicating copy to clipboard operation
bayeslite copied to clipboard

Sort out confusing mappings between categorical variables

Open fsaad opened this issue 8 years ago • 0 comments

This issue relates to cgpm_metamodel (and more generally, any metamodel supporting categorical variables).

Suppose I have a CGPM x which outputs integer 0 for False and integer 1 for True.

Moreover, I provide a table to bayesdb

_rowid_ | x
--------+------
0       | True
1       | False
2       | True

and specify that x is CATEGORICAL. In this instance, the first value that appears is True so the cgpm mapping is going to be {True: 0, False:1}, which effectively inverts the results of simulate and results in extremely confusing output. Even more confusingly, we could (and did) have a situation like the following:

_rowid_ | x      | y
--------+--------+------
0       |  True  | False
1       |  False | False
2       |  True  | False

Now the mapping for x is going to be {True: 0, False: 1}, and the mapping for y is {False:0, True:1}.

The key issue here is that sometimes the CGPM outputs a variable in which the actual labels have a semantic meaning, whereas other times the CGPM outputs variables in which the labels can be switched without a change in the semantic meaning (like a cluster id for example).

A solution for the case above is to define a BINARY data type. But for a general categorical mapping, we probably need another statistical type which indicates the labels mean something (but are not necessarily ordered, like ordinal, etc).

One solution is to initialize the cgpm-lib with not only the statistical type but also mapping from small integers to categories.

fsaad avatar Jul 11 '16 19:07 fsaad