bayeslite
bayeslite copied to clipboard
Population variables lose case distinctions
This isn't itself quite a bug, but downstream things sometimes expect the case to match from queries, where sqlite3 preserves the original case of the variable in the underlying table, and from the results of bayesdb_variable_names
.
We should fix everything downstream to do case-insensitive comparisons (e.g.: https://github.com/probcomp/iventure/issues/38), but until then, we can work around those mistakes by using the original case in the bayesdb_variable
table. Won't help with existing bdb files, but will suppress downstream mistakes for any new bdb files.
Just hit this issue as a bug proper.
bdb.execute('''
CREATE POPULATION satellites_p for satellites_t WITH SCHEMA (
GUESS STATTYPES FOR (*)
)
''')
bdb.execute('''
CREATE METAMODEL satellites_m FOR satellites_p
WITH BASELINE crosscat (
OVERRIDE GENERATIVE MODEL FOR
anticipated_lifetime
GIVEN
class_of_orbit,
perigee_km,
apogee_km,
period_minutes,
date_of_launch,
launch_mass_kg,
dry_mass_kg
USING ordinary_least_squares
)
''')
bayeslite.exception.BQLError: Unmodellable variables:
set(['date_of_launch', 'apogee_km', 'class_of_orbit',
'launch_mass_kg', 'period_minutes', 'dry_mass_kg', 'perigee_km'
])
And the failure happens here:
https://github.com/probcomp/bayeslite/blob/master/src/metamodels/cgpm_metamodel.py#L1378-L1382
where printing needed
and modeled
gives:
modeled = set([u'Power_watts', u'longitude_radians_of_geo', u'Contractor',
u'Period_minutes', u'Country_of_Operator', u'Launch_Vehicle', u'Users',
u'Apogee_km', u'Operator_Owner', 'anticipated_lifetime',
u'Anticipated_Lifetime', u'Purpose', u'Dry_Mass_kg', u'Launch_Site',
u'Perigee_km', u'Date_of_Launch', u'Class_of_Orbit', u'Eccentricity',
u'Inclination_radians', u'Type_of_Orbit', u'Source_Used_for_Orbital_Data',
u'Launch_Mass_kg', u'Country_of_Contractor'
])
needed = set(['date_of_launch', 'apogee_km', 'class_of_orbit',
'launch_mass_kg', 'period_minutes', 'dry_mass_kg', 'perigee_km'
])
Also notice duplicate of anticipated_lifetime
and Anticipated_lifetime
(the regression variable). Not sure why this bug has not manifested before.
Actually, what broke this behavior was @riastradh-probcomp's fix https://github.com/probcomp/bayeslite/commit/d7c5b3fdb419a5ea2611057b881ea02331d27d2b to this issue, which appears to have caused the regression.
Update: git revert d7c5b3fdb419a5ea2611057b881ea02331d27d2b indeed fixed the issue with Latent variables reported above. Surprised why nothing in the test suite failed after the commit in question (probably because all cases were lower uniformly).
@riastradh-probcomp I'm going to revert this commit --- it has completely broken the create schema logic in https://github.com/probcomp/bayeslite/blob/master/src/metamodels/cgpm_metamodel.py#L1043, and refactoring that complex state machine to is likely to be more effort and introduce more bugs than updating downstream clients to either use the variable name from the base table and/or do case insensitive comparisons.
This state machine already appears to be problematic from a cursory glance. But it doesn't look like it needs refactoring -- it looks like it just needs a few casefolds inserted judiciously: replace clause.var
(s
) by casefold(clause.var)
(s
), and casefold the output of bayesdb_variable_names
.
Here's an example I expect to fail with the state machine as is:
create table t(x, y);
create population p for t;
create metamodel m for p with baseline crosscat(
latent Z categorical;
override model for y, z given x using piecewise
)
I'm running against this problem too. I want to do a JOIN of a dependence probability table and a table read from a CSV file, and the cases aren't matching. If there's some reason to keep this (mis)feature, maybe it could be a CREATE POPULATION option, so you could decide at that point whether you want folding or not?
Date: Tue, 29 Aug 2017 19:29:28 +0000 (UTC) From: Jonathan Rees probably [email protected]
I'm running against this problem too. I want to do a JOIN of a dependence probability table and a table read from a CSV file, and the cases aren't matching. If there's some reason to keep this (mis)feature, maybe it could be a CREATE POPULATION option, so you could decide at that point whether you want folding or not?
I don't think adding more configuration knobs is a good idea.
It seems to me:
(a) The table should retain the case of the CREATE TABLE columns. (b) The population should retain the case of the CREATE POPULATION variables. (c) Everything in bayeslite should be identified case-insensitively.
Since the implementation of JOIN is deferred to sqlite3, I am surprised that you are hitting anything about mismatched cases, because all names in sqlite3 are identified case-insensitively.
What reproduces the symptom you're referring to?
I think I made an assumption, because it just didn't occur to me that sqlite3 would do case insensitive comparison. I was seeing a very slow JOIN and thought this was a likely cause. But if what you say it true, then it has a different cause.
Is = case insensitive as well?
Date: Tue, 29 Aug 2017 12:50:17 -0700 From: Jonathan Rees probably [email protected]
I think I made an assumption, because it just didn't occur to me that sqlite3 would do case insensitive comparison. I was seeing a very slow JOIN and thought this was a likely cause. But if what you say it true, then it has a different cause.
Is = case insensitive as well?
On columns declared COLLATE NOCASE, yes. The semantics is described at https://www.sqlite.org/datatype3.html#collation. The default is COLLATE BINARY, which is case-sensitive. In the bayeslite metadata tables, e.g. bayesdb_population, names (population names, variable names, &c.) are stored with COLLATE NOCASE -- see src/schema.py for details.