bayeslite icon indicating copy to clipboard operation
bayeslite copied to clipboard

Population variables lose case distinctions

Open riastradh-probcomp opened this issue 7 years ago • 9 comments

This isn't itself quite a bug, but downstream things sometimes expect the case to match from queries, where sqlite3 preserves the original case of the variable in the underlying table, and from the results of bayesdb_variable_names.

We should fix everything downstream to do case-insensitive comparisons (e.g.: https://github.com/probcomp/iventure/issues/38), but until then, we can work around those mistakes by using the original case in the bayesdb_variable table. Won't help with existing bdb files, but will suppress downstream mistakes for any new bdb files.

riastradh-probcomp avatar May 02 '17 21:05 riastradh-probcomp

Just hit this issue as a bug proper.

        bdb.execute('''
            CREATE POPULATION satellites_p for satellites_t WITH SCHEMA (
                GUESS STATTYPES FOR (*)
            )
        ''')
        bdb.execute('''
            CREATE METAMODEL satellites_m FOR satellites_p
                    WITH BASELINE crosscat (
                OVERRIDE GENERATIVE MODEL FOR
                    anticipated_lifetime
                GIVEN
                    class_of_orbit,
                    perigee_km,
                    apogee_km,
                    period_minutes,
                    date_of_launch,
                    launch_mass_kg,
                    dry_mass_kg
                USING ordinary_least_squares
            )
        ''')
bayeslite.exception.BQLError: Unmodellable variables:
set(['date_of_launch', 'apogee_km', 'class_of_orbit',
    'launch_mass_kg', 'period_minutes', 'dry_mass_kg', 'perigee_km'
])

And the failure happens here: https://github.com/probcomp/bayeslite/blob/master/src/metamodels/cgpm_metamodel.py#L1378-L1382 where printing needed and modeled gives:

modeled = set([u'Power_watts', u'longitude_radians_of_geo', u'Contractor',
    u'Period_minutes', u'Country_of_Operator', u'Launch_Vehicle', u'Users',
    u'Apogee_km', u'Operator_Owner', 'anticipated_lifetime',
    u'Anticipated_Lifetime', u'Purpose', u'Dry_Mass_kg', u'Launch_Site',
    u'Perigee_km', u'Date_of_Launch', u'Class_of_Orbit', u'Eccentricity',
    u'Inclination_radians', u'Type_of_Orbit', u'Source_Used_for_Orbital_Data',
    u'Launch_Mass_kg', u'Country_of_Contractor'
])

needed = set(['date_of_launch', 'apogee_km', 'class_of_orbit',
    'launch_mass_kg', 'period_minutes', 'dry_mass_kg', 'perigee_km'
])

Also notice duplicate of anticipated_lifetime and Anticipated_lifetime (the regression variable). Not sure why this bug has not manifested before.

fsaad avatar Jun 02 '17 15:06 fsaad

Actually, what broke this behavior was @riastradh-probcomp's fix https://github.com/probcomp/bayeslite/commit/d7c5b3fdb419a5ea2611057b881ea02331d27d2b to this issue, which appears to have caused the regression.

Update: git revert d7c5b3fdb419a5ea2611057b881ea02331d27d2b indeed fixed the issue with Latent variables reported above. Surprised why nothing in the test suite failed after the commit in question (probably because all cases were lower uniformly).

fsaad avatar Jun 02 '17 15:06 fsaad

@riastradh-probcomp I'm going to revert this commit --- it has completely broken the create schema logic in https://github.com/probcomp/bayeslite/blob/master/src/metamodels/cgpm_metamodel.py#L1043, and refactoring that complex state machine to is likely to be more effort and introduce more bugs than updating downstream clients to either use the variable name from the base table and/or do case insensitive comparisons.

fsaad avatar Jun 05 '17 17:06 fsaad

This state machine already appears to be problematic from a cursory glance. But it doesn't look like it needs refactoring -- it looks like it just needs a few casefolds inserted judiciously: replace clause.var(s) by casefold(clause.var)(s), and casefold the output of bayesdb_variable_names.

riastradh-probcomp avatar Jun 06 '17 19:06 riastradh-probcomp

Here's an example I expect to fail with the state machine as is:

create table t(x, y);
create population p for t;
create metamodel m for p with baseline crosscat(
    latent Z categorical;
    override model for y, z given x using piecewise
)

riastradh-probcomp avatar Jun 06 '17 19:06 riastradh-probcomp

I'm running against this problem too. I want to do a JOIN of a dependence probability table and a table read from a CSV file, and the cases aren't matching. If there's some reason to keep this (mis)feature, maybe it could be a CREATE POPULATION option, so you could decide at that point whether you want folding or not?

jar600 avatar Aug 29 '17 19:08 jar600

Date: Tue, 29 Aug 2017 19:29:28 +0000 (UTC) From: Jonathan Rees probably [email protected]

I'm running against this problem too. I want to do a JOIN of a dependence probability table and a table read from a CSV file, and the cases aren't matching. If there's some reason to keep this (mis)feature, maybe it could be a CREATE POPULATION option, so you could decide at that point whether you want folding or not?

I don't think adding more configuration knobs is a good idea.

It seems to me:

(a) The table should retain the case of the CREATE TABLE columns. (b) The population should retain the case of the CREATE POPULATION variables. (c) Everything in bayeslite should be identified case-insensitively.

Since the implementation of JOIN is deferred to sqlite3, I am surprised that you are hitting anything about mismatched cases, because all names in sqlite3 are identified case-insensitively.

What reproduces the symptom you're referring to?

riastradh-probcomp avatar Aug 29 '17 19:08 riastradh-probcomp

I think I made an assumption, because it just didn't occur to me that sqlite3 would do case insensitive comparison. I was seeing a very slow JOIN and thought this was a likely cause. But if what you say it true, then it has a different cause.

Is = case insensitive as well?

jar600 avatar Aug 29 '17 19:08 jar600

Date: Tue, 29 Aug 2017 12:50:17 -0700 From: Jonathan Rees probably [email protected]

I think I made an assumption, because it just didn't occur to me that sqlite3 would do case insensitive comparison. I was seeing a very slow JOIN and thought this was a likely cause. But if what you say it true, then it has a different cause.

Is = case insensitive as well?

On columns declared COLLATE NOCASE, yes. The semantics is described at https://www.sqlite.org/datatype3.html#collation. The default is COLLATE BINARY, which is case-sensitive. In the bayeslite metadata tables, e.g. bayesdb_population, names (population names, variable names, &c.) are stored with COLLATE NOCASE -- see src/schema.py for details.

riastradh-probcomp avatar Aug 29 '17 20:08 riastradh-probcomp