Add auto_coerce option to handle mixed-type columns during Arrow conversion
Summary
When converting DataFrames to Arrow format for plotting, PyArrow fails on columns with mixed types (e.g., bytes and floats in the same column). This commonly occurs when columns have NaN values (float) mixed with other types like bytes/strings, resulting in cryptic ArrowTypeError messages.
Problem
Users encounter errors like:
pyarrow.lib.ArrowTypeError: ("Expected bytes, got a 'float' object", 'Conversion failed for column amount with type object')
The error occurs in PlotterBase._table_to_arrow() when calling pa.Table.from_pandas(). The error message doesn't clearly indicate which column is problematic or how to fix it.
Current Workaround
Users must manually debug by testing each column:
for c in df.columns:
print('trying col', c)
pa.Table.from_pandas(df[[c]])
Then fix problematic columns with:
df[c_bad] = df[c_bad].astype(str)
Proposed Solution
Add a strict=False / auto_coerce=True parameter that:
- Catches mixed-type conversion errors - Wrap the Arrow conversion in try/except
- Auto-coerces problematic columns - Convert mixed-type columns to string as a fallback
- Emits warnings - Log which columns were coerced so users are aware
- Provides better error messages - When strict mode fails, indicate which column(s) failed and suggest fixes
API Options
Option A - Parameter on plot():
g.plot(df, auto_coerce=True) # default could be True for convenience
Option B - Global setting:
graphistry.settings(auto_coerce_types=True)
Option C - Both (parameter overrides global)
Impact
This is a common pain point for users working with "dirty" real-world data. Graphistry already handles many dirty data cases (detecting/synthesizing missing nodes, guessing time columns), so auto-coercing mixed-type columns would be consistent with that philosophy.
Environment
- Affects:
PlotterBase._table_to_arrow()(line ~2139) - Related to pandas DataFrame → PyArrow Table conversion
- Common trigger: columns with
NaN(float) mixed with bytes/strings
Related Discussion
From Slack discussion with @rjurney - varying/mixed data types are a frequent source of Graphistry errors for users.
Likely other calls that trigger upload, like upload()
Unclear how deep / where to thread this
Also, other ways graphs can be dirty ... Should this scope to just pd--> arrow, and rest is server's problem? And later, we can add a strict=true/false to server too.
Probably expose this helper , like g.to_arrow()