Add schema option to "read_xlsx"
Is your feature request related to a problem?
Using the "spatial" extension work around to read excel files I was able to
con.read_geo("test.xlsx", table_name="newtable", layer='sheetname', open_options = ['HEADERS=AUTO','FIELD_TYPES=STRING'])
table = con.table("newtable")
table.try_cast(schema)
and then use the "try_cast" on "newtable" to apply a schema; however, the con.read_xlsx with "all_varchar=True" creates a table with data that is less malleable when using
table = con.read_xlsx("test.xlsx", header=True, sheet="sheetname", ignore_errors=True, all_varchar=True)
table.try_cast(schema)
many of the "date" and "time" values do not cast ~correctly~ (auto-magically as desired) when using the read_xlsx
add a method to the read_xlsx that takes a schema and runs the raw sql under the hood
CREATE TABLE test (a DOUBLE, b DOUBLE);
COPY test FROM 'test.xlsx' WITH (FORMAT xlsx, HEADER);
SELECT * FROM test;
where the python ibis.Schema generates the raw sql for the CREATE TABLE field types.
Alternatively, create a ergonomic way to do this if the desire is to limit the api complexity of the read_xlsx method.
What is the motivation behind your request?
The workflow example can solved with con.load_extension("excel") and a con.raw_sql but the ergonomics would be cleaner if there was way to get the same result within ibis in some form
Describe the solution you'd like
Add a schema parameter to "read_xlsx" method
table = con.read_xlsx(con.read_xlsx("test.xlsx", header=True, sheet="sheetname", ignore_errors=True, schema = schema)
As an alternative this maybe a support issue for the DuckDB native excel extension but I assume their implementation choices were for robustness even if the all_varchar = True renders data that is less malleable than what is returned from read_geo with FIELD_TYPES = STRING
Bellow is not an implementation recommendation just a hack to scratch my particular itch
def schema2fieldtypes(schema: ibis.Schema) -> str:
fields = []
for k,v in schema.as_struct().items():
fields.append(f"{k} {v.name.upper()}")
return ", ".join(fields).replace("FLOAT64", "FLOAT8")
def read_excel(excel_file: pathlib.Path, connection: ibis.BaseBackend , sheet_name: str = "Sheet1", schema: ibis.Schema = None, sheet_range: str = None):
from uuid import uuid4
table_name = f"read_excel_{uuid4()}".replace("-","")
connection.load_extension("excel")
if not schema:
return connection.read_xlsx(excel_file, sheet=sheet_name, ignore_errors=True, range=sheet_range)
fields = f"({schema2fieldtypes(schema)})"
sheet_range = f", RANGE '{sheet_range}'" if sheet_range else ""
raw_sql = ""
raw_sql += f"CREATE TABLE {table_name} {fields}; "
raw_sql += f"COPY {table_name} FROM {excel_file} WITH (FORMAT xlsx, SHEET {sheet_name}, IGNORE_ERRORS{sheet_range}); "
raw_sql += f"SELECT * FROM {table_name};"
connection.raw_sql(raw_sql)
return connection.table(table_name)
here is a more robust version of my read_excel hack if someone stubles on this post later and finds it useful
import ibis
import ibis.expr.schema as sch
import pathlib
from uuid import uuid4
def read_excel(excel_file: pathlib.Path, connection: ibis.BaseBackend , sheet_name: str = "Sheet1", schema: sch.Schema = None, sheet_range: str = None) -> ibis.expr.types.Table:
"""
Reads data from an Excel sheet into an Ibis table expression.
This function provides two modes:
1. If no schema is provided, it uses the backend's native `read_xlsx`.
2. If a schema is provided, it constructs and executes raw SQL to:
a. Create a temporary table with the specified schema.
b. Use the backend's COPY command (requires 'excel' extension)
to load data from the Excel file into the temporary table.
c. Return an Ibis table expression referencing the temporary table.
This method allows for explicit schema enforcement during loading.
:param excel_file: Path to the Excel file (.xlsx).
:type excel_file: pathlib.Path
:param connection: Active Ibis backend connection.
:type connection: ibis.BaseBackend
:param sheet_name: Name of the sheet to read within the Excel file.
:type sheet_name: str
:param schema: Optional Ibis schema to define table structure and types.
If provided, triggers the raw SQL CREATE/COPY method.
:type schema: ibis.Schema, optional
:param sheet_range: Optional specific cell range to read (e.g., "A1:D10").
Only used when a schema is provided (for the COPY command).
:type sheet_range: str, optional
:return: An Ibis table expression representing the data from the Excel sheet.
:rtype: ibis.expr.types.Table
"""
table_name = f"read_excel_{uuid4()}".replace("-","")
connection.load_extension("excel")
if not schema:
return connection.read_xlsx(excel_file, sheet=sheet_name, ignore_errors=True, range=sheet_range)
# Get the dialect string from the connection (e.g., 'duckdb', 'postgres')
# Using connection.name often works as the dialect identifier for sqlglot
dialect_name = connection.name
try:
# Convert ibis.Schema to list of sqlglot ColumnDef objects
column_defs = schema.to_sqlglot(dialect=dialect_name)
# Format each ColumnDef object as SQL string and join
fields_str = ", ".join(col_def.sql(dialect=dialect_name) for col_def in column_defs)
fields = f"({fields_str})"
except ImportError:
# Fallback or raise error if sqlglot isn't available, though it usually is.
# For now, let's raise an error if it's missing.
raise ImportError("sqlglot is required for schema.to_sqlglot functionality.")
sheet_range_clause = f", RANGE '{sheet_range}'" if sheet_range else ""
raw_sql = ""
raw_sql += f"CREATE OR REPLACE TABLE {table_name} {fields}; "
raw_sql += f"COPY {table_name} FROM '{str(excel_file)}' WITH (FORMAT xlsx, SHEET '{sheet_name}', IGNORE_ERRORS TRUE{sheet_range_clause}); "
connection.raw_sql(raw_sql)
return connection.table(table_name)
What version of ibis are you running?
10.3.1
What backend(s) are you using, if any?
DuckDB
Code of Conduct
- [x] I agree to follow this project's Code of Conduct look no
Thanks for the issue!
Do you have a file or some example data that doesn't work as you'd like it to that you could upload here?
I will have to make a segregate example xlsx file that surfaces the same issues I encountered with my data and I should be able to provide that in the next few days
Below shows some ipython repl output using the test.xlsx file of these test
read_geo (w/ 'FIELD_TYPES=STRING") cast read_geo result
read_xlsx (w/ 'all_varchar = True') cast read_xlsx result
read_xlsx (w/ 'all_varchar = False') cast read_xlsx result
read_xlsx (w/ my custom function and a schema)
issues encountered: read_geo implementation is tolerant to a blank 1st line after the header and will read past the end of the file read_xlsx will stop reading at blank 1st line after the header and need the "range" set as a parameter
read_geo (w/ 'FIELD_TYPES=STRING") interprets into a string form that is more flexible for "try_cast"
both read_geo and read_xlsx may have parity at guessing the types with a more strictly formatted excel file, but this should be tested (I have not done any exhaustive testing of this)
castings = dict(date_type = "date",
int_type = "int",
time0 = "time",
time1 = "time",
bool0 = "bool",
bool1 = "bool")
In [87]: geo = con.read_geo("test.xlsx", open_options = ['HEADERS=AUTO','FIELD_TYPES=STRING'])
In [88]: geo
Out[88]:
┏━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃ date_type ┃ int_type ┃ time0 ┃ time1 ┃ bool0 ┃ bool1 ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
│ string │ string │ string │ string │ string │ string │
├────────────┼──────────┼──────────┼──────────┼────────┼────────┤
│ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │
│ 2012/03/18 │ 125 │ NA │ NA │ PASS │ NA │
│ 2012/03/18 │ 203 │ NA │ NA │ PASS │ NA │
│ 2012/03/18 │ 404 │ 14:05:00 │ 14:08:47 │ Y │ Y │
│ 2012/03/18 │ 235 │ 14:05:00 │ 14:07:25 │ Y │ Y │
│ 2012/03/18 │ 913 │ 14:05:00 │ 14:07:58 │ Y │ Y │
│ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │
│ 2012/03/19 │ 203 │ 15:07:00 │ 15:09:30 │ Y │ N │
│ 2012/03/19 │ 537 │ 15:07:00 │ 15:11:03 │ Y │ Y │
│ 2012/03/19 │ 613 │ 15:07:00 │ ND │ Y │ Y │
│ … │ … │ … │ … │ … │ … │
└────────────┴──────────┴──────────┴──────────┴────────┴────────┘
In [89]: geo.try_cast(castings)
Out[89]:
┏━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
┃ date_type ┃ int_type ┃ time0 ┃ time1 ┃ bool0 ┃ bool1 ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│ date │ int64 │ time │ time │ boolean │ boolean │
├────────────┼──────────┼──────────┼──────────┼─────────┼─────────┤
│ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │
│ 2012-03-18 │ 125 │ NULL │ NULL │ NULL │ NULL │
│ 2012-03-18 │ 203 │ NULL │ NULL │ NULL │ NULL │
│ 2012-03-18 │ 404 │ 14:05:00 │ 14:08:47 │ True │ True │
│ 2012-03-18 │ 235 │ 14:05:00 │ 14:07:25 │ True │ True │
│ 2012-03-18 │ 913 │ 14:05:00 │ 14:07:58 │ True │ True │
│ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │
│ 2012-03-19 │ 203 │ 15:07:00 │ 15:09:30 │ True │ False │
│ 2012-03-19 │ 537 │ 15:07:00 │ 15:11:03 │ True │ True │
│ 2012-03-19 │ 613 │ 15:07:00 │ NULL │ True │ True │
│ … │ … │ … │ … │ … │ … │
└────────────┴──────────┴──────────┴──────────┴─────────┴─────────┘
In [90]: xlsx = con.read_xlsx("test.xlsx", header=True, ignore_errors=True, all_varchar=True, range = "A1:F11")
In [91]: xlsx
Out[91]:
┏━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃ date_type ┃ int_type ┃ time0 ┃ time1 ┃ bool0 ┃ bool1 ┃
┡━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
│ string │ string │ string │ string │ string │ string │
├───────────┼──────────┼─────────────────────┼─────────────────────┼────────┼────────┤
│ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │
│ 40986 │ 125 │ NA │ NA │ PASS │ NA │
│ 40986 │ 203 │ NA │ NA │ PASS │ NA │
│ 40986 │ 404 │ 0.58680555555555558 │ 0.5894328703703704 │ Y │ Y │
│ 40986 │ 235 │ 0.58680555555555558 │ 0.58848379629629632 │ Y │ Y │
│ 40986 │ 913 │ 0.58680555555555558 │ 0.58886574074074072 │ Y │ Y │
│ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │
│ 40987 │ 203 │ 0.62986111111111109 │ 0.63159722222222225 │ Y │ N │
│ 40987 │ 537 │ 0.62986111111111109 │ 0.63267361111111109 │ Y │ Y │
│ 40987 │ 613 │ 0.62986111111111109 │ ND │ Y │ Y │
└───────────┴──────────┴─────────────────────┴─────────────────────┴────────┴────────┘
In [92]: xlsx.try_cast(castings)
Out[92]:
┏━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
┃ date_type ┃ int_type ┃ time0 ┃ time1 ┃ bool0 ┃ bool1 ┃
┡━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│ date │ int64 │ time │ time │ boolean │ boolean │
├───────────┼──────────┼───────┼───────┼─────────┼─────────┤
│ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │
│ NULL │ 125 │ NULL │ NULL │ NULL │ NULL │
│ NULL │ 203 │ NULL │ NULL │ NULL │ NULL │
│ NULL │ 404 │ NULL │ NULL │ True │ True │
│ NULL │ 235 │ NULL │ NULL │ True │ True │
│ NULL │ 913 │ NULL │ NULL │ True │ True │
│ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │
│ NULL │ 203 │ NULL │ NULL │ True │ False │
│ NULL │ 537 │ NULL │ NULL │ True │ True │
│ NULL │ 613 │ NULL │ NULL │ True │ True │
└───────────┴──────────┴───────┴───────┴─────────┴─────────┘
In [93]: xlsx_type_guess = con.read_xlsx("test.xlsx", header=True, ignore_errors=True, range = "A1:F11")
In [94]: xlsx_type_guess
Out[94]:
┏━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
┃ date_type ┃ int_type ┃ time0 ┃ time1 ┃ bool0 ┃ bool1 ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│ date │ float64 │ float64 │ time │ float64 │ float64 │
├────────────┼──────────┼──────────┼──────────┼─────────┼─────────┤
│ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │
│ 2012-03-18 │ 125.0 │ NULL │ NULL │ NULL │ NULL │
│ 2012-03-18 │ 203.0 │ NULL │ NULL │ NULL │ NULL │
│ 2012-03-18 │ 404.0 │ 0.586806 │ 14:08:47 │ NULL │ NULL │
│ 2012-03-18 │ 235.0 │ 0.586806 │ 14:07:25 │ NULL │ NULL │
│ 2012-03-18 │ 913.0 │ 0.586806 │ 14:07:58 │ NULL │ NULL │
│ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │
│ 2012-03-19 │ 203.0 │ 0.629861 │ 15:09:30 │ NULL │ NULL │
│ 2012-03-19 │ 537.0 │ 0.629861 │ 15:11:03 │ NULL │ NULL │
│ 2012-03-19 │ 613.0 │ 0.629861 │ NULL │ NULL │ NULL │
└────────────┴──────────┴──────────┴──────────┴─────────┴─────────┘
In [95]: xlsx_type_guess.try_cast(castings)
Out[95]:
┏━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
┃ date_type ┃ int_type ┃ time0 ┃ time1 ┃ bool0 ┃ bool1 ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│ date │ int64 │ time │ time │ boolean │ boolean │
├────────────┼──────────┼───────┼──────────┼─────────┼─────────┤
│ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │
│ 2012-03-18 │ 125 │ NULL │ NULL │ NULL │ NULL │
│ 2012-03-18 │ 203 │ NULL │ NULL │ NULL │ NULL │
│ 2012-03-18 │ 404 │ NULL │ 14:08:47 │ NULL │ NULL │
│ 2012-03-18 │ 235 │ NULL │ 14:07:25 │ NULL │ NULL │
│ 2012-03-18 │ 913 │ NULL │ 14:07:58 │ NULL │ NULL │
│ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │
│ 2012-03-19 │ 203 │ NULL │ 15:09:30 │ NULL │ NULL │
│ 2012-03-19 │ 537 │ NULL │ 15:11:03 │ NULL │ NULL │
│ 2012-03-19 │ 613 │ NULL │ NULL │ NULL │ NULL │
└────────────┴──────────┴───────┴──────────┴─────────┴─────────┘
And using my custom reader defined in my earlier comment
In [96]: excel_hack = read_excel("test.xlsx", connection=con, schema = geo.try_cast(castings).schema(), sheet_range = "A1:F11")
In [97]: excel_hack
Out[97]:
┏━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
┃ date_type ┃ int_type ┃ time0 ┃ time1 ┃ bool0 ┃ bool1 ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│ date │ int64 │ time │ time │ boolean │ boolean │
├────────────┼──────────┼──────────┼──────────┼─────────┼─────────┤
│ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │
│ 2012-03-18 │ 125 │ NULL │ NULL │ NULL │ NULL │
│ 2012-03-18 │ 203 │ NULL │ NULL │ NULL │ NULL │
│ 2012-03-18 │ 404 │ 14:05:00 │ 14:08:47 │ True │ True │
│ 2012-03-18 │ 235 │ 14:05:00 │ 14:07:25 │ True │ True │
│ 2012-03-18 │ 913 │ 14:05:00 │ 14:07:58 │ True │ True │
│ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │
│ 2012-03-19 │ 203 │ 15:07:00 │ 15:09:30 │ True │ False │
│ 2012-03-19 │ 537 │ 15:07:00 │ 15:11:03 │ True │ True │
│ 2012-03-19 │ 613 │ 15:07:00 │ NULL │ True │ True │
└────────────┴──────────┴──────────┴──────────┴─────────┴─────────┘
In [98]:
My first thought is that we should try to fix this upstream in duckdb. I would hope that in duckdb, FROM read_xlsx(all_varchar=true) and FROM st_read(open_options = ['HEADERS=AUTO','FIELD_TYPES=STRING']) should result in the same data.
As an alternative this maybe a support issue for the DuckDB native excel extension but I assume their implementation choices were for robustness even if the all_varchar = True renders data that is less malleable than what is returned from read_geo with FIELD_TYPES = STRING
Are you saying you don't think this is a good idea? Can you explain this further?
I would think that FROM read_xlsx(all_varchar=true) and FROM st_read(open_options ['HEADERS=AUTO','FIELD_TYPES=STRING']); however, I do not know if producing the same output was a goal of the DuckDB excel extension as the st_read implementation may be handling many edge cases and the perhaps read_xlsx expects the excel file to be formatted a specific way (just conjecture as I haven't communicated with DuckDB folks). The DuckDB api does provide a more structured option where the user creates the table with fields types followed by COPY... and this seems to wrangle the poorly formatted excel file into a usable form consistent with st_read (using STRINGS) followed by try_cast
@omdaniel can you file an issue with duckdb and link to it here? Let's try to solve it there before we start to be hacky on our end.
I opened an issue with DuckDB excel extension #40