Bonfire Database field types need review/revision but suggest some peer discussion/review.

Today I ported the mysql db tables and data to a postresql db. While doing so I was manually revising the field data types to fit postgres and noticed they are wildly inconsistently applied even in the existing mysql db defintions. For example table ids tend to be several various int types and sizes and child relation keys in other tables often do not match the type of the parent. I think it's time for a review of the db fields and some cleanup. I also think such a decision should be agreed to by core developers before a willy nilly update is applied. Concerns include choosing the optimal field types for various types of fields, and agreeing to a consistent application of types for future additions. Also, this might provide an opportunity to make the entire db model more platform agnostic for mysql/postgres/oracle, whatever comes next in db land that CI supports.

May 28 '14 23:05 corkyk

I'm fully open to suggestions in this area, and I'm sure @lonnieezell would be as well. One of the issues I see right away is that the common ID field definition generated by the module builder is:

'id' => array(
    'type'                   => 'int',
    'constraint'          => 11,
    'auto_increment' => true,
),

and you would presumably use the same type/constraint on any foreign key.

However, many of the core tables in bonfire use bigint (e.g. users and activities). I probably wouldn't mind changing them in the long-term, but many databases will complain loudly if you attempt to migrate a populated field to a smaller data type, and I cringe about what may happen if someone had a use-case for a bigint in those fields.

We also need to consider which data types are available on as many platforms as possible (and are compatible across those platforms).

I'd definitely like to hear what changes needed to be made to get things working in postgreSQL. It would certainly help in identifying areas that need work, as well as identifying potential solutions. I've personally always intended to get the code to a point where I could, if I had to, migrate my site's database to MS SQL Server as smoothly as possible. While the situation is better than when I started, I don't think it would be a smooth transition.

May 29 '14 14:05 mwhitneysdsu

@corkyk @mwhitneysdsu @lonnieezell

Corky and myself have some decent progress on this task. Currently the bonfire/admin area is running without any errors. I have not tried inserting new things yet but that is next.

I have a file with the ALTER statements to the bf_ tables we had to run for using postgres, and we have some fixes for bonfire/core files.

How would you like me to proceed with committing these, is simply pushing to my fork the best way?

https://github.com/cpcbell/Bonfire

Thanks all!

Clay

May 29 '14 19:05 cpcbell

If you push it to your fork, we can at least look over the changes and make some decisions on what can be pulled in as-is vs. what needs to be modified before it gets merged.

The alter statements can be added as a migration or you can just put them up as a gist.

May 29 '14 19:05 mwhitneysdsu

Some additional thoughts: I went through and reviewed the schema of Bonfire's tables. I agree that there are quite a few things that could at least be made more consistent across tables.

The fields for email addresses in the users and email_queue tables are different sizes, and neither is large enough to hold the maximum valid size of an email address (254 characters).
The sessions table is (more or less) out of our hands, but we should be able to pull an IP Address from the sessions table and store it in the users or login_attempts table, if necessary (and it may be worthwhile to look into the maximum size of an IPv6 address to ensure there is enough space in our tables to store them, even if the sessions table may not be able to do so).
The activities and settings table store module names at 255 and 50 character lengths, respectively, and there are a couple of cases in which they use some variances in convention ('core.ui' and 'any' appear in my data).

There is some variance in the deleted fields in various tables, which I believe is, in part, due to some question over whether the field should permit a UNIX timestamp vs. a 0/1 value (and in some cases code checks for a deleted value > 0 for this reason).

Most status fields are simply flags (TINYINT(1) treated as a boolean 0/1), but permissions uses an enum status field, and one of the three statuses is 'deleted'.

MySQL's integer constraints define the display width of the field (and are hopefully ignored on most platforms), but it doesn't make a lot of sense that the activities.deleted field is specified as TINYINT(12), when MySQL's maximum value for TINYINT is 127 or 255 (so the maximum valid display width would be 4, if a negative value is stored in the field, or 3 for unsigned), unless there is some code in the system that is using the display width to pad the value out to a width of 12 characters... In contrast, the roles.deleted field is defined as INT(1).

An activity, user, or cookie uses a DATETIME created_on field, and most other fields to track changes use DATETIME fields as well (though the naming is dependent on what they are tracking), but:

login attempts use a TIMESTAMP time field (and the login_attempts.time field defaults to CURRENT_TIMESTAMP, which probably won't work in most databases)
the user's display_name_changed field is a DATE
again, though sessions is mostly out of our hands, it uses an UNSIGNED INT(10) to track the last_activity (UNIX timestamp).

The login_attempts table stores the login used as a VARCHAR(50), but there is no length constraint on the form and the actual login could be a username (VARCHAR(30)) or email address (VARCHAR(120) for users.email, but email addresses can certainly be longer).

The users.reset_by field isn't clear in intent, even when you know it's an INT(10), but this stores a UNIX timestamp (which means it should probably be UNSIGNED INT(10)).

The users.timezone field uses CI's timezone references, and is defined as CHAR(4). A couple of the not-so-well documented timezone references CI produces use 5 or 6 characters, and, of course, CI 3.x does away with these timezones in favor of the PHP timezones.

There are a handful of other oddities, too, like the description field in the permissions table being shorter than the name field.

One of the ugliest issues, though, is the user ID I mentioned earlier:

The users table defines it as UNSIGNED BIGINT(20) (users.id)
The foreign keys in the activities and user_cookies table are defined as BIGINT(20) (activities.user_id and user_cookies.user_id)
The foreign key in the user meta table defines it as INT(20) (user_meta.user_id)

*I'm using the term "foreign key" loosely here, since the relationship isn't really defined.

If there's going to be an issue with the user ID, it will likely show up in the user meta first, given that the maximum value of an INT is significantly smaller than a BIGINT (because, remember, that 20 constraint is a display width and doesn't change the actual values that can be stored). Generally, though, an UNSIGNED variable still gives you a significant increase in the maximum value you can store, so even the user cookies and activities tables would eventually be outpaced (though that's a lot of users).

Another consideration is that the activity ID is a BIGINT(20), so just creating enough users to outpace the user ID reference on the activities table would also create more activities than the activities table can store (assuming you didn't reset the activity_id or disable activity logging during creation of the users). You also wouldn't be able to store even a single piece of meta data for each user, given that the user meta ID is an INT(20).

May 29 '14 19:05 mwhitneysdsu

Yep, all of that ^ Quite a good analysis. As my co-worker, @cpcbell (Clay) pointed out, we have some changes to suggest already for improved postgres compatibility, in his fork. Looking forward to your additional feedback on those. I think they would all be safe to merge, with the one caveat you identified about potentially reducing int field sizes for previously deployed systems. I enjoyed your coverage above. It's all spot on.

May 29 '14 22:05 corkyk

Yes, I am definitely open to changes in this area. And I think that @mwhitneysdsu has done an excellent run down as usual. These changes are definitely something that should be done and having it work out of the box with PostgreSQL would be icing on the cake.

May 30 '14 02:05 lonnieezell

OK, I'm looking through the fork, but so far the only thing I've found is this: https://github.com/cpcbell/Bonfire/commit/234b5dc774171363ba8f5dba9e36ec59ab2868f9

The error mentioned actually sounds familiar to me (the last two projects I worked on before my current job used MS SQL and Oracle), so, once I test the change against MySQL it should be no problem to include it.

Do you get a similar error on /admin/settings/roles? The user model does something similar in the count_by_roles() method, and the roles controller calls it in the index.

Edit: I've tested the patch to the activity_model, it seems to work fine. If you can submit a pull request we can merge it.

May 30 '14 15:05 mwhitneysdsu

Hey @mwhitneysdsu I haven't had a chance to commit the bulk of our changes to my fork. I will try my best to do it today, and if not, tomorrow. Thx!

May 31 '14 13:05 cpcbell

I started putting together a migration file to update just the database changes that should not require code changes. When looking for some information on which types would be available in postgre, I found that the postgre_forge._create_table() method had been updated to (more or less) convert common MySQL types to postgre-compatible types. This is great, except that postgre_forge._alter_table() doesn't do the same thing.

The type conversions performed are:

MySQL	Postgre
TINYINT	SMALLINT
UNSIGNED SMALLINT	INTEGER
MEDIUMINT	INTEGER
UNSIGNED INT	BIGINT
INT	INTEGER
UNSIGNED BIGINT	NUMERIC
DOUBLE	DOUBLE PRECISION
DATETIME	TIMESTAMP
LONGTEXT	TEXT
BLOB	BYTEA

Additionally, all primary keys are converted to SERIAL if they are AUTO_INCREMENT fields, and constraints are removed from INT fields.

I'm still trying to determine how best to handle this, in addition to looking through the other drivers to determine whether this is done elsewhere.

Jun 02 '14 15:06 mwhitneysdsu

I did a little research yesterday, and I'm not feeling any better about where we need to go with this. I took the SQL standard data types (adapted from Wikipedia after wading through other sources) and compared 4 different SQL packages (MySQL, postgreSQL, MS SQL Server, and Oracle Database 11).

Standard	MySQL	postgreSQL	MS SQL	Oracle	Description
char(n)/character(n)	char/character	char/character	char/character	char/character	fixed-width n-character string [1]
character varying(n) or varchar(n)	varchar/character varying	varchar/character varying	varchar/char varying/character varying	varchar2/character varying/char varying	variable-width string with a maximum size of n-characters [2]
national character(n) or nchar(n)	nchar/national char	char/character	nchar/national char/national character	nchar/national char/national character	fixed-width n-character string supporting an international character set [3]
national character varying(n) or nvarchar(n)	nvarchar/national varchar	varchar/character varying	nvarchar/national char varying/national character varying	nvarchar2/national character varying/national char varying/nchar varying	variable-width string with a maximum size of n-characters supporting an international character set [4]
bit(n)	bit	bit	N/A*	N/A?	An array of n bits [5]
bit varying(n)	N/A?	bit varying	N/A?	N/A?	An array of up to n bits
integer	int/integer	int/integer	int/integer	int/integer	The basic, usually 4-byte integer [6]
smallint	smallint	smallint	smallint	smallint	Usually 2-byte integer [7]
bigint	bigint	bigint	bigint	N/A?	Usually 8-byte integer [8]
float, real, and double precision	float, real, double/double precision	float, real, double/double precision	float, real, double precision	float, real, double precision	approximate value types for storing decimal data [9]
numeric(precision, scale) or decimal(precision, scale)	numeric, dec/decimal, fixed	numeric, decimal	numeric, dec/decimal	numeric, decimal	exact (fixed-precision) decimal types [10]
date	date	date	date	date	Can store a date of the format YYY-MM-DD (many other formats are often supported) [11]
time	time	time	time	N/A?	Can store a time of the format hh:mm:ss[.nnn](many other formats are often supported) [12]
time with time zone or timetz	N/A?	time with time zone	N/A?	N/A?	Can store a time with time zone information
timestamp	datetime*	timestamp	datetime2**	timestamp	Can store a timestamp of the format YYYY-MM-DD hh:mm:ss (many other formats are often supported) [13]
timestamp with time zone or timestamptz	timestamp	timestamp with time zone	N/A?	timestamp with time zone	Can store a timestamp with time zone information

MS and Oracle define an n-byte string, Oracle has an optional parameter to define an n-character string
MS and Oracle again define an n-byte string, with Oracle supplying an optional argument to specify an n-character string - Oracle's documentation specifically states that varchar should not be used
Note that in MySQL nchar is a synonym/alias for char, I did not find a similar alias in the documentation for postgreSQL
The notes for nchar in MySQL and postgreSQL (3, above) apply to nvarchar; as with varchar, Oracle uses nvarchar2 instead of nvarchar
*MS SQL uses bit to define a single bit, multiple bit columns are packed into bytes for storage
In Oracle, integer is an alias of number(38) (a number with 38 significant digits, none after the decimal)
In Oracle, smallint is an alias of number(38)
I could not locate bigint in the standard itself, and Oracle did not list it as an available alias
In most cases real, float, and double are aliases for one another, though it is common to use real and double as aliases for 4-byte and 8-byte floats, respectively, with float allowing definition of a given scale (which still defines either a 4-byte or 8-byte float, in most cases).
In most cases numeric and decimal are aliases for one another
Oracle dates can also store time alongside the date in a date field
Oracle does not appear to have a specific time data type
_MySQL's datetime appears to behave more like the standard timestamp than MySQL's timestamp *_MS SQL has an internal timestamp data type which is completely unrelated and should definitely not be used for this purpose, datetime2 is probably the closest data type in MS SQL to the standard timestamp

I knew datetimes/timestamps were likely to be painful. What I didn't realize is how painful nearly everything other than the most basic numeric types would be. Additionally, the integer support looks fairly clean, but we do have issues there simply because unsigned is not supported consistently. On MS SQL Server, a tinyint is always unsigned (though that quickly becomes a non-issue because tinyint is not supported on postgre and Oracle).

The real disaster, in my opinion, is the string support. MS and Oracle will pretty easily kill UTF-8 support when using char/varchar (though Oracle gives you the option to configure the database to support UTF-8 in char/varchar2, they still have the differing syntax where the constraint is interpreted as bytes unless a char parameter is included), but postgre, for some reason, doesn't appear to supply nchar/nvarchar aliases (as MySQL does). Oracle's varchar2/nvarchar2 also makes this less than pleasant, and Microsoft's default character set for nchar/nvarchar is UCS-2 or UTF-16, depending on the version and configuration of the server.

Plus, we use tinyint and bigint relatively heavily, and, even discounting the unsigned issues, they're just not available everywhere. When it comes to cross-platform support, you can pretty much forget that set and enum even exist, run into the same issues with text/ntext as char/nchar, and probably forget about binary strings (and binary large objects) as well.

That's not to say it can't be done, and I think postgre is likely a path of least resistance to multi-platform support when coming from MySQL, but it's still painful and will require a little more thought.

Edit:

My first thought for dealing with the issues with timestamps was just to switch to UNIX timestamps and integers, but I quickly came across the obvious (1970 is not the greatest lower-bound for historical data) and not-so-obvious (2038 is the upper bound for a timestamp when using a signed 32-bit integer) limitations of that format. http://en.wikipedia.org/wiki/Year_2038_problem
One potential solution for the bigint comes from Oracle's treatment of integer aliases: define them as numeric. The upper bound of an unsigned bigint would require a numeric(20,0). To still allow moving data from an int to a bigint in Oracle, they could be defined as numeric(28,0) (or some higher precision). According to the documentation, storage requirements for numeric are based on multiples of 9 digits (on each side of the decimal) in MySQL (resulting in 9 bytes for 20 digits or 13 bytes for 28), groups of 4 digits (+ 3-8 bytes overhead) in postgreSQL (13-18 bytes or 17-22 bytes), precision (in roughly 9-digit increments) in MS SQL (13 bytes for a precision of 20-28), and round((precision + 1)/2) + 1 bytes for Oracle (12 bytes or 16 bytes).

Jun 03 '14 14:06 mwhitneysdsu

@mwhitneysdsu @corkyk

Wanted to give you an update. Sorry I haven't had time to push to my fork, the changes are coming along nicely / frequently and it hasn't been too painful once I get on a roll.

At this point, most of the Admin console is working properly against postgres, and today I got it creating new users properly which is major progress. Once I get everything with registration happy I will push to my fork.

I will add the file with my alter table/column statements, but keep in mind some of that is not necessary since our database was initially created using boolean columns instead of integers ( we were trying to be clever ) and I had to switch that back to integers. You can ignore those all those alters.

All in all this hasn't been too bad and I think seeing these fixes will expose some places where code can be streamlined. In some places I have done quick hacks to achieve functionality so I'm sure you will have a good laugh and code actual long term solutions.

Thanks,

Clay

Jun 04 '14 20:06 cpcbell

@cpcbell @mwhitneysdsu

Good work Clay, I wish I could have been more involved in helping and taking part in this discussion, but as you've seen I have been trapped in what I will graciously call "business requirements gathering" for the whole week.

And thanks for the input / insights Mat

Jun 05 '14 05:06 corkyk

Bonfire Bonfire copied to clipboard

Database field types need review/revision but suggest some peer discussion/review.

Bonfire
Bonfire copied to clipboard