superset fix(serialization): allow unicode in json serialization

SUMMARY

PR changes the output of json.dumps to show unicode characters as they are, instead of using escape sequences.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

Before:

Снимок экрана 08 06 2025@2x

After:

Снимок экрана 08 06 2025@2x

TESTING INSTRUCTIONS

ADDITIONAL INFORMATION

[x] Has associated issue: Fixes #19388 Fixes #22904
[ ] Required feature flags:
[ ] Changes UI
[ ] Includes DB Migration (follow approval process in SIP-59)
- [ ] Migration is atomic, supports rollback & is backwards-compatible
- [ ] Confirm DB migration upgrade and downgrade tested
- [ ] Runtime estimates and downtime expectations provided
[ ] Introduces new feature or API
[ ] Removes existing feature or API

Jun 08 '25 05:06 Quatters

Codecov Report

:white_check_mark: All modified and coverable lines are covered by tests. :white_check_mark: Project coverage is 72.86%. Comparing base (8a8248b) to head (86a72a2). :warning: Report is 233 commits behind head on master.

Additional details and impacted files

@@             Coverage Diff             @@
##           master   #33720       +/-   ##
===========================================
+ Coverage        0   72.86%   +72.86%     
===========================================
  Files           0      561      +561     
  Lines           0    40629    +40629     
  Branches        0     4288     +4288     
===========================================
+ Hits            0    29605    +29605     
- Misses          0     9908     +9908     
- Partials        0     1116     +1116

Flag	Coverage Δ
hive	`47.10% <ø> (?)`
mysql	`71.86% <ø> (?)`
postgres	`71.91% <ø> (?)`
presto	`50.85% <ø> (?)`
python	`72.82% <ø> (?)`
sqlite	`71.45% <ø> (?)`
unit	`100.00% <ø> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:

:snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
:package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Jun 08 '25 05:06 codecov[bot]

This seems like a win to me, though seeing test-mysql fail in CI (assuming it's related to this change) shows that this change might have some upgrade implications.

If there are indeed change management issues, it may require:

adding a line in UPDATING.md about required checks/changes around people upgrading: something like "if you use mysql as your metadata database, you may need to configure MySQL with this specific flag {...}". Also making this clear in the docs around "cofiguring mysql"
double checking that things that have been serialized with ensure_ascii=True can be deserialized with ensure_ascii=False to make sure we don't need a database migration to re-encode what's in the metadata db. (GPT thinks it is NOT a problem ...)

related GPT thread https://chatgpt.com/share/684b5999-8d50-8010-ab53-d62e10188b50

Jun 12 '25 22:06 mistercrunch

re-running that test, just in case.

Jun 13 '25 21:06 rusackas

Related: https://chatgpt.com/share/68507758-62e8-8010-8f63-a73f6c1eed55

Seems the connection string AND [maybe] the database settings need to be altered to support this for MySQL specifically (?)

If we can make CI pass, add a note in UPDATING.md, and document that it's a requirement for MySQL somewhere I'm fine with the change.

Other option would be to set ensure_ascii=False only for analytics database use cases, and keep it True for metadata database interaction, though that's probably confusing and maybe harder to maintain as we'd have different logic based on context...

Jun 16 '25 20:06 mistercrunch

Copilot's suggestions, FWIW:

Migration Failure Fix Update the migration script to handle the foreign key constraint before dropping the column. Modify the script to drop the foreign key first and then the column:

from alembic import op
import sqlalchemy as sa

def upgrade():
    # Drop the foreign key constraint
    op.drop_constraint('slices_ibfk_1', 'slices', type_='foreignkey')
    # Drop the column
    op.drop_column('slices', 'druid_datasource_id')

def downgrade():
    # Recreate the column (if needed during downgrade)
    op.add_column('slices', sa.Column('druid_datasource_id', sa.Integer(), nullable=True))
    # Recreate the foreign key constraint
    op.create_foreign_key('slices_ibfk_1', 'slices', 'druid_datasource', ['druid_datasource_id'], ['id'])

Test Failure Fix The OperationalError indicates that the position_json column does not support certain Unicode characters. Ensure the column uses a character set that supports all Unicode characters (e.g., utf8mb4 for MySQL).

Steps:

Update the database schema:

ALTER TABLE dashboards MODIFY COLUMN position_json TEXT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

Ensure the database connection is configured to use utf8mb4: Modify the connection string in your configuration:

SQLALCHEMY_DATABASE_URI = "mysql+pymysql://<user>:<password>@<host>/<db_name>?charset=utf8mb4"
```

Jun 16 '25 22:06 rusackas

Done here

Jun 18 '25 10:06 Quatters

I don't think we need a db migration, except maybe for MySQL (unclear). In theory the json/simplejson implementation is compatible across the ensure_ascii setting. Now the issue seems to be that MySQL, depending on its setup and CHARACTER_SET default setting, chokes when getting unicode characters.

Jun 18 '25 16:06 mistercrunch

@Quatters , thank for doing all the work here. This seems viable. I added a comment to check in with folks using MySQL as their metadata db to get their input/blessing on this approach.

Note that this PR fixes user-visible issues around unicode handling with analytics databases.

Jun 18 '25 16:06 mistercrunch

Hi @mistercrunch, @michael-s-molina, any updates on this?

Jul 09 '25 12:07 Quatters

Hi @mistercrunch, @michael-s-molina, any updates on this?

We were OOO due to the holidays but I added this item to our team's backlog. Hoping to test this in the next couple weeks.

Jul 10 '25 12:07 michael-s-molina

Resolved the conflict and currently re-running CI. Hopefully we can get this merged, pronto!

Jul 15 '25 20:07 rusackas

@Quatters @mistercrunch @rusackas I executed the following migration to convert our existing MySQL database to be adherent to the changes in this PR.

Step 1: Backup Your Database First

mysqldump -u your_username -p your_database_name > backup_$(date +%Y%m%d_%H%M%S).sql

Step 2: Check Current State

-- Check current database character set
SHOW CREATE DATABASE your_database_name;

-- Check all tables and their character sets
SELECT TABLE_NAME, TABLE_COLLATION 
FROM INFORMATION_SCHEMA.TABLES 
WHERE TABLE_SCHEMA = 'your_database_name';

-- Check all columns with character sets
SELECT TABLE_NAME, COLUMN_NAME, CHARACTER_SET_NAME, COLLATION_NAME, DATA_TYPE
FROM INFORMATION_SCHEMA.COLUMNS 
WHERE TABLE_SCHEMA = 'your_database_name' 
AND CHARACTER_SET_NAME IS NOT NULL
ORDER BY TABLE_NAME, COLUMN_NAME;

Step 3: Change Database Default

-- Change the database default character set
ALTER DATABASE your_database_name 
DEFAULT CHARACTER SET utf8mb4 
COLLATE utf8mb4_0900_ai_ci;

Step 4: Generate Conversion Commands for All Tables

-- Generate ALTER TABLE commands for all tables
SELECT CONCAT(
    'ALTER TABLE `', TABLE_NAME, 
    '` CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;'
) AS conversion_command
FROM INFORMATION_SCHEMA.TABLES 
WHERE TABLE_SCHEMA = 'your_database_name' 
AND TABLE_TYPE = 'BASE TABLE'
ORDER BY TABLE_NAME;

Step 5: Optional - Check Table Sizes Before Conversion

SELECT 
    TABLE_NAME,
    TABLE_ROWS,
    ROUND(((DATA_LENGTH + INDEX_LENGTH) / 1024 / 1024), 2) AS 'Size_MB',
    ROUND(AVG_ROW_LENGTH) AS 'Avg_Row_Length'
FROM INFORMATION_SCHEMA.TABLES 
WHERE TABLE_SCHEMA = 'your_database_name' 
AND TABLE_TYPE = 'BASE TABLE'
ORDER BY (DATA_LENGTH + INDEX_LENGTH);

Step 6: Prepare for Conversion

-- Disable foreign key checks temporarily (if you have foreign keys)
SET FOREIGN_KEY_CHECKS = 0;

-- Optional: Increase timeouts for large tables
SET SESSION innodb_lock_wait_timeout = 300;
SET SESSION lock_wait_timeout = 300;

Step 7: Convert All Tables Replace your_database_name with your actual database name, then run the output from Step 4. Here's the template:

-- Example conversion commands (replace with your actual table names)
ALTER TABLE `table1` CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;
ALTER TABLE `table2` CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;
ALTER TABLE `table3` CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;
-- ... (continue for all tables)

Step 8: Re-enable Constraints

-- Re-enable foreign key checks
SET FOREIGN_KEY_CHECKS = 1;

Step 9: Verify the Conversion

-- Verify database default
SHOW CREATE DATABASE your_database_name;

-- Verify all tables are converted
SELECT TABLE_NAME, TABLE_COLLATION 
FROM INFORMATION_SCHEMA.TABLES 
WHERE TABLE_SCHEMA = 'your_database_name'
AND TABLE_COLLATION != 'utf8mb4_0900_ai_ci';

-- Should return no rows if all tables are converted

-- Verify all character columns are converted
SELECT TABLE_NAME, COLUMN_NAME, CHARACTER_SET_NAME, COLLATION_NAME
FROM INFORMATION_SCHEMA.COLUMNS 
WHERE TABLE_SCHEMA = 'your_database_name' 
AND CHARACTER_SET_NAME IS NOT NULL
AND CHARACTER_SET_NAME != 'utf8mb4';

-- Should return no rows if all columns are converted

Some observations:

You can't simply change the database default character set. If you do that, the setting will only be applicable to new tables.
This means that you need to convert all tables and their data to the new setting.
The process does not take very long time (100k rows in 17 seconds) but does require downtime. @Quatters this needs to be explicitly called in UPDATING.md.
We also need to mention the required change for the connection string in UPDATING.md as admins will not read the Configuring Superset docs.
After the migration, characters like 大盘 are rendered correctly but emojis like 😀🌟 are rendered as ????????

Jul 16 '25 13:07 michael-s-molina

I guess the question here is, if i'm using a mysql database and I don't run the migrations, will this break superset for me? If it does, probably better to just create a migration file and have it detect if the db is mysql and if it is, then run it, else just pass.

Jul 16 '25 17:07 sadpandajoe

I think automatic migration as @sadpandajoe suggests is better even if it's not strictly required, to be sure that all users starting from version X applied this migration. This seems more consistent and less problems in the future.

Therefore I plan to

Create a migration file that detects whether MySQL is being used and applies the actions suggested by @michael-s-molina.
Write some tests to ensure the migration works correctly.
Add a note in UPDATING.md stating that applying the migration requires downtime.

Jul 26 '25 02:07 Quatters

Looks like this needs conflict resolution and perhaps some migration work. I'll set it to draft for now.

Aug 12 '25 21:08 rusackas