soda-core
soda-core copied to clipboard
Order of categories influences chi_square statistic
Steps to reproduce
- Create a simple dataset with 2:1 ratio
data.sql
For some reason, I was unable to run soda with lesser number of rows
create table Employee (
id int primary key,
name varchar(255)
);
insert into Employee (id, name) values (1, 'Alice');
insert into Employee (id, name) values (2, 'Bob');
insert into Employee (id, name) values (3, 'Alice');
insert into Employee (id, name) values (11, 'Alice');
insert into Employee (id, name) values (12, 'Bob');
insert into Employee (id, name) values (13, 'Alice');
insert into Employee (id, name) values (21, 'Alice');
insert into Employee (id, name) values (22, 'Bob');
insert into Employee (id, name) values (23, 'Alice');
insert into Employee (id, name) values (31, 'Alice');
insert into Employee (id, name) values (32, 'Bob');
insert into Employee (id, name) values (33, 'Alice');
insert into Employee (id, name) values (41, 'Alice');
insert into Employee (id, name) values (42, 'Bob');
insert into Employee (id, name) values (43, 'Alice');
insert into Employee (id, name) values (51, 'Alice');
insert into Employee (id, name) values (52, 'Bob');
insert into Employee (id, name) values (53, 'Alice');
- Run the following check
checks for Employee:
- row_count = 18
- distribution_difference(name) < 0.05:
method: chi_square
distribution reference file: ./distribution.yaml
with distribution.yaml:
dataset: employee
column: name
distribution_type: categorical
distribution_reference:
weights:
- 0.7
- 0.3
bins:
- Alice
- Bob
Expected behavior
chi_square statistic is close to zero, since the number of Alice rows is 12 and Bob's is 6
Actual behavior
the statistic value is high (~0.6)
Misc
When I change the order of weights but not the bins, the statistic is OK
CLOUD-8980