fix(serialization): allow unicode in json serialization
SUMMARY
PR changes the output of json.dumps to show unicode characters as they are, instead of using escape sequences.
BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF
Before:
After:
TESTING INSTRUCTIONS
ADDITIONAL INFORMATION
- [x] Has associated issue: Fixes #19388 Fixes #22904
- [ ] Required feature flags:
- [ ] Changes UI
- [ ] Includes DB Migration (follow approval process in SIP-59)
- [ ] Migration is atomic, supports rollback & is backwards-compatible
- [ ] Confirm DB migration upgrade and downgrade tested
- [ ] Runtime estimates and downtime expectations provided
- [ ] Introduces new feature or API
- [ ] Removes existing feature or API
Codecov Report
:white_check_mark: All modified and coverable lines are covered by tests.
:white_check_mark: Project coverage is 72.86%. Comparing base (8a8248b) to head (86a72a2).
:warning: Report is 233 commits behind head on master.
Additional details and impacted files
@@ Coverage Diff @@
## master #33720 +/- ##
===========================================
+ Coverage 0 72.86% +72.86%
===========================================
Files 0 561 +561
Lines 0 40629 +40629
Branches 0 4288 +4288
===========================================
+ Hits 0 29605 +29605
- Misses 0 9908 +9908
- Partials 0 1116 +1116
| Flag | Coverage Δ | |
|---|---|---|
| hive | 47.10% <ø> (?) |
|
| mysql | 71.86% <ø> (?) |
|
| postgres | 71.91% <ø> (?) |
|
| presto | 50.85% <ø> (?) |
|
| python | 72.82% <ø> (?) |
|
| sqlite | 71.45% <ø> (?) |
|
| unit | 100.00% <ø> (?) |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
- :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.
This seems like a win to me, though seeing test-mysql fail in CI (assuming it's related to this change) shows that this change might have some upgrade implications.
If there are indeed change management issues, it may require:
- adding a line in
UPDATING.mdabout required checks/changes around people upgrading: something like "if you use mysql as your metadata database, you may need to configure MySQL with this specific flag {...}". Also making this clear in the docs around "cofiguring mysql" - double checking that things that have been serialized with
ensure_ascii=Truecan be deserialized withensure_ascii=Falseto make sure we don't need a database migration to re-encode what's in the metadata db. (GPT thinks it is NOT a problem ...)
related GPT thread https://chatgpt.com/share/684b5999-8d50-8010-ab53-d62e10188b50
re-running that test, just in case.
Related: https://chatgpt.com/share/68507758-62e8-8010-8f63-a73f6c1eed55
Seems the connection string AND [maybe] the database settings need to be altered to support this for MySQL specifically (?)
If we can make CI pass, add a note in UPDATING.md, and document that it's a requirement for MySQL somewhere I'm fine with the change.
Other option would be to set ensure_ascii=False only for analytics database use cases, and keep it True for metadata database interaction, though that's probably confusing and maybe harder to maintain as we'd have different logic based on context...
Copilot's suggestions, FWIW:
- Migration Failure Fix Update the migration script to handle the foreign key constraint before dropping the column. Modify the script to drop the foreign key first and then the column:
from alembic import op
import sqlalchemy as sa
def upgrade():
# Drop the foreign key constraint
op.drop_constraint('slices_ibfk_1', 'slices', type_='foreignkey')
# Drop the column
op.drop_column('slices', 'druid_datasource_id')
def downgrade():
# Recreate the column (if needed during downgrade)
op.add_column('slices', sa.Column('druid_datasource_id', sa.Integer(), nullable=True))
# Recreate the foreign key constraint
op.create_foreign_key('slices_ibfk_1', 'slices', 'druid_datasource', ['druid_datasource_id'], ['id'])
- Test Failure Fix The OperationalError indicates that the position_json column does not support certain Unicode characters. Ensure the column uses a character set that supports all Unicode characters (e.g., utf8mb4 for MySQL).
Steps:
- Update the database schema:
ALTER TABLE dashboards MODIFY COLUMN position_json TEXT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci; - Ensure the database connection is configured to use utf8mb4: Modify the connection string in your configuration:
SQLALCHEMY_DATABASE_URI = "mysql+pymysql://<user>:<password>@<host>/<db_name>?charset=utf8mb4"
```
Done here
I don't think we need a db migration, except maybe for MySQL (unclear). In theory the json/simplejson implementation is compatible across the ensure_ascii setting. Now the issue seems to be that MySQL, depending on its setup and CHARACTER_SET default setting, chokes when getting unicode characters.
@Quatters , thank for doing all the work here. This seems viable. I added a comment to check in with folks using MySQL as their metadata db to get their input/blessing on this approach.
Note that this PR fixes user-visible issues around unicode handling with analytics databases.
Hi @mistercrunch, @michael-s-molina, any updates on this?
Hi @mistercrunch, @michael-s-molina, any updates on this?
We were OOO due to the holidays but I added this item to our team's backlog. Hoping to test this in the next couple weeks.
Resolved the conflict and currently re-running CI. Hopefully we can get this merged, pronto!
@Quatters @mistercrunch @rusackas I executed the following migration to convert our existing MySQL database to be adherent to the changes in this PR.
Step 1: Backup Your Database First
mysqldump -u your_username -p your_database_name > backup_$(date +%Y%m%d_%H%M%S).sql
Step 2: Check Current State
-- Check current database character set
SHOW CREATE DATABASE your_database_name;
-- Check all tables and their character sets
SELECT TABLE_NAME, TABLE_COLLATION
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA = 'your_database_name';
-- Check all columns with character sets
SELECT TABLE_NAME, COLUMN_NAME, CHARACTER_SET_NAME, COLLATION_NAME, DATA_TYPE
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_SCHEMA = 'your_database_name'
AND CHARACTER_SET_NAME IS NOT NULL
ORDER BY TABLE_NAME, COLUMN_NAME;
Step 3: Change Database Default
-- Change the database default character set
ALTER DATABASE your_database_name
DEFAULT CHARACTER SET utf8mb4
COLLATE utf8mb4_0900_ai_ci;
Step 4: Generate Conversion Commands for All Tables
-- Generate ALTER TABLE commands for all tables
SELECT CONCAT(
'ALTER TABLE `', TABLE_NAME,
'` CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;'
) AS conversion_command
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA = 'your_database_name'
AND TABLE_TYPE = 'BASE TABLE'
ORDER BY TABLE_NAME;
Step 5: Optional - Check Table Sizes Before Conversion
SELECT
TABLE_NAME,
TABLE_ROWS,
ROUND(((DATA_LENGTH + INDEX_LENGTH) / 1024 / 1024), 2) AS 'Size_MB',
ROUND(AVG_ROW_LENGTH) AS 'Avg_Row_Length'
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA = 'your_database_name'
AND TABLE_TYPE = 'BASE TABLE'
ORDER BY (DATA_LENGTH + INDEX_LENGTH);
Step 6: Prepare for Conversion
-- Disable foreign key checks temporarily (if you have foreign keys)
SET FOREIGN_KEY_CHECKS = 0;
-- Optional: Increase timeouts for large tables
SET SESSION innodb_lock_wait_timeout = 300;
SET SESSION lock_wait_timeout = 300;
Step 7: Convert All Tables Replace your_database_name with your actual database name, then run the output from Step 4. Here's the template:
-- Example conversion commands (replace with your actual table names)
ALTER TABLE `table1` CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;
ALTER TABLE `table2` CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;
ALTER TABLE `table3` CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;
-- ... (continue for all tables)
Step 8: Re-enable Constraints
-- Re-enable foreign key checks
SET FOREIGN_KEY_CHECKS = 1;
Step 9: Verify the Conversion
-- Verify database default
SHOW CREATE DATABASE your_database_name;
-- Verify all tables are converted
SELECT TABLE_NAME, TABLE_COLLATION
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA = 'your_database_name'
AND TABLE_COLLATION != 'utf8mb4_0900_ai_ci';
-- Should return no rows if all tables are converted
-- Verify all character columns are converted
SELECT TABLE_NAME, COLUMN_NAME, CHARACTER_SET_NAME, COLLATION_NAME
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_SCHEMA = 'your_database_name'
AND CHARACTER_SET_NAME IS NOT NULL
AND CHARACTER_SET_NAME != 'utf8mb4';
-- Should return no rows if all columns are converted
Some observations:
- You can't simply change the database default character set. If you do that, the setting will only be applicable to new tables.
- This means that you need to convert all tables and their data to the new setting.
- The process does not take very long time (100k rows in 17 seconds) but does require downtime. @Quatters this needs to be explicitly called in
UPDATING.md. - We also need to mention the required change for the connection string in
UPDATING.mdas admins will not read the Configuring Superset docs. - After the migration, characters like 大盘 are rendered correctly but emojis like 😀🌟 are rendered as ????????
I guess the question here is, if i'm using a mysql database and I don't run the migrations, will this break superset for me? If it does, probably better to just create a migration file and have it detect if the db is mysql and if it is, then run it, else just pass.
I think automatic migration as @sadpandajoe suggests is better even if it's not strictly required, to be sure that all users starting from version X applied this migration. This seems more consistent and less problems in the future.
Therefore I plan to
- Create a migration file that detects whether MySQL is being used and applies the actions suggested by @michael-s-molina.
- Write some tests to ensure the migration works correctly.
- Add a note in UPDATING.md stating that applying the migration requires downtime.
Looks like this needs conflict resolution and perhaps some migration work. I'll set it to draft for now.