datajoint-python
datajoint-python copied to clipboard
Support making UTF8/Unicode configurable
For many char
and varchar
variables, we want to stay with the compact latin1
character set. latin1
has been MySQL's default and DataJoint relied on the default character set. If the default was changed, it caused unforeseen problems such as #453.
However, we might need to the UTF8mb4
character set for fields that could potentially be used for entering arbitrary text.
This issue will specify how character sets will be specified in datajoint.
Thinking this could be handled via dj.config defaults, although that allows the potential for client configuration mismatches.
I suppose the question is if we want this to be a global setting, or database/table specific, and if so, then where to store the config (server side or client side).
If anything, this ought to be a way to specify the character set for each field, with latin1
being the default. As far as MySQL is considered, character set can be specified for each column separately from the table-wide "default" character set (which we have now set to be latin1
).
So the real question is what would be a good syntax for someone to specify special the character set to be used for the attributes should they decide that's necessary. We could imagine using something like:
experimenter: varchar(128)[utf8] # name of the experimenter
since the data type is passed to MySQL, the following should already work:
experimenter: varchar(128) character set utf8mb4 # name
We just need to test and document.
Re: per-field settings: this seems to 'take'. Documentation TBD.
Re: database level -
Thinking 2 options might be good:
a) take from dj.config (good for overrides but implies dj.config needs to be consistent across users)
dj.config['database.charset'] = 'utf8mb4'
b) database/schema level: have this be a dj.schema instantiation option or method:
schema = dj.schema('foo_database', charset='utf8mb4')
schema = dj.schema('foo_database')
schema.set_default_charset('utf8mb4')
c) table level: perhaps also some option to decoration or another 'special' attribute - although this does require synchronization between users, it would typically be done via code which is shared anyway. I think decoration options might make the most sense here.
i)
@schema(charset='utf8mb4')
class Foo(dj.Manual):
...
ii)
@schema
class Foo(dj.Manual):
charset = 'utf8mb4'
definition = '''...'''
also noted - a mechanism for character set collation at schema and table level should also be included with solution.
changed bug from : support UTF8
to Support making UTF8/Unicode configurable
; scope is beyond UTF8, level/understanding of 1st-tier support (e.g. tests, prioritizing issues, etc) for alternative configurations remains TBD.