datajoint-python icon indicating copy to clipboard operation
datajoint-python copied to clipboard

Support making UTF8/Unicode configurable

Open dimitri-yatsenko opened this issue 6 years ago • 6 comments

For many char and varchar variables, we want to stay with the compact latin1 character set. latin1 has been MySQL's default and DataJoint relied on the default character set. If the default was changed, it caused unforeseen problems such as #453.

However, we might need to the UTF8mb4 character set for fields that could potentially be used for entering arbitrary text.

This issue will specify how character sets will be specified in datajoint.

dimitri-yatsenko avatar Mar 04 '18 00:03 dimitri-yatsenko

Thinking this could be handled via dj.config defaults, although that allows the potential for client configuration mismatches.

I suppose the question is if we want this to be a global setting, or database/table specific, and if so, then where to store the config (server side or client side).

ixcat avatar Mar 05 '18 20:03 ixcat

If anything, this ought to be a way to specify the character set for each field, with latin1 being the default. As far as MySQL is considered, character set can be specified for each column separately from the table-wide "default" character set (which we have now set to be latin1).

So the real question is what would be a good syntax for someone to specify special the character set to be used for the attributes should they decide that's necessary. We could imagine using something like:

experimenter:  varchar(128)[utf8]     # name of the experimenter

eywalker avatar Mar 06 '18 07:03 eywalker

since the data type is passed to MySQL, the following should already work:

experimenter: varchar(128)   character set utf8mb4   # name 

We just need to test and document.

dimitri-yatsenko avatar Mar 06 '18 13:03 dimitri-yatsenko

Re: per-field settings: this seems to 'take'. Documentation TBD.

Re: database level -

Thinking 2 options might be good:

a) take from dj.config (good for overrides but implies dj.config needs to be consistent across users)

dj.config['database.charset'] = 'utf8mb4'

b) database/schema level: have this be a dj.schema instantiation option or method:

schema = dj.schema('foo_database', charset='utf8mb4')
schema = dj.schema('foo_database')
schema.set_default_charset('utf8mb4')

c) table level: perhaps also some option to decoration or another 'special' attribute - although this does require synchronization between users, it would typically be done via code which is shared anyway. I think decoration options might make the most sense here.

i)

@schema(charset='utf8mb4')
class Foo(dj.Manual):
    ...

ii)

@schema
class Foo(dj.Manual):
    charset = 'utf8mb4'
    definition = '''...'''

ixcat avatar Mar 28 '18 19:03 ixcat

also noted - a mechanism for character set collation at schema and table level should also be included with solution.

ixcat avatar Jul 24 '18 18:07 ixcat

changed bug from : support UTF8 to Support making UTF8/Unicode configurable; scope is beyond UTF8, level/understanding of 1st-tier support (e.g. tests, prioritizing issues, etc) for alternative configurations remains TBD.

ixcat avatar Jul 24 '18 18:07 ixcat