great-tables
great-tables copied to clipboard
Unicode characters in column headers break when exporting to png
Prework
- [X] Read and agree to the code of conduct and contributing guidelines.
- [X] If there is already a relevant issue, whether open or closed, comment on the existing thread instead of posting a new issue. - Haven't found one.
Description
When using GT.save() to save a table to a PNG file, Unicode characters in the column headers are not rendered correctly.
Reproducible example
I cloned the repo and installed it with pip install -e .[all].
Then I ran the following code:
import pandas as pd
from great_tables import GT
df = pd.DataFrame({"Żaba": ["1", "2"], "Koń": ["3", "4"]})
GT(df).save("output.png")
The output file output.png renders the column headers as "Ĺ»aba" and "KoĹ„" instead of the expected "Żaba" and "Koń".
Expected result
The Unicode characters in the column headers should be rendered correctly.
Expected:
Actual:
Development environment
- Operating System: macOS Sequoia 15.6.1
- great_tables Version: Tested on the current main branch: a59301b. Also present in
v0.20.0.
Additional context
I did some extra digging to try to understand what the root cause of the problem is:
-
The
as_raw_html()call in thesave()function link is not passed an explicitmake_pageparameter, which means the function runs withmake_page=Falselink. -
This, in turn means the table is rendered as div (inline) element only.
-
This html is later saved to a temporary file link and opened using a webdriver link.
-
The webdriver takes a screenshot link and thus generates the png.
The default webdriver is chrome - link.
What's happening is chrome (or, specifically, this library: compact_enc_det) doesn't recognize the right charset to use and thus breaks the characters, setting the document.characterSet to windows-1250 instead of utf-8.
Workarounds / Fixes
There are a couple of possible solutions/workarounds. All of them boil down to choosing a different web driver, letting chrome know what charset we want explicitly or changing the html structure so that chrome infers the right charset.
The ones I've identified:
- Prepend charset metatag (
<meta charset='utf-8'>) to the rendered html div (make_pageis stillFalsehere). This is enough for Chrome to infer the character set correctly.
html_content = "<meta charset='utf-8'>" + html_content
- Pass
make_page=True- this works, because the rendered page contains the charset definition (set here: link). The pngs I tried rendering using this flow seemed to be the same as the ones on the current main branch, but it might be more invasive fix than (1).
html_content = as_raw_html(self, make_page=True)
- Use a different webdriver - Firefox, for one, recognizes the charset correctly. Users can always use a different web driver or the default web driver can be updated. (former sounds like bad UX, latter seems unwarranted given the obscurity of this issue)
(
GT(df)
.save(file="output.png", web_driver="firefox")
)
- Use ASCII-based column names and relabel them: In the breaking scenario, chrome needs to interpret this:
<th id="Żaba">Żaba</th>
However, when doing this:
df_safe = df.rename(columns={"Żaba": "Zaba", "Koń": "Kon"})
GT(df_safe).cols_label(Zaba="Żaba", Kon="Koń").save("output.png")
the html ends up being:
<th id="Zaba">Żaba</th>
which is interpreted by Chrome correctly.
I'd be more than happy to contribute a PR, even if it's a one-liner fix. Thanks a lot for the great great_tables!