jupysql
jupysql copied to clipboard
Boxplot error in non-integer column
Dataset: https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques
%sqlplot boxplot --table train.csv --column LotFrontage
This results in error:
ProgrammingError: (duckdb.BinderException) Binder Error: No function matches the given name and argument types 'avg(VARCHAR)'. You might need to add explicit type casts.
Candidate functions:
avg(DECIMAL) -> DECIMAL
avg(SMALLINT) -> DOUBLE
avg(INTEGER) -> DOUBLE
avg(BIGINT) -> DOUBLE
avg(HUGEINT) -> DOUBLE
avg(DOUBLE) -> DOUBLE
LINE 1: ...RDER BY "LotFrontage") AS percentiles, AVG("LotFrontage") AS mean, COUNT(*) AS...
^
[SQL: SELECT PERCENTILE_DISC(LIST_VALUE(0.25, 0.50, 0.75)) WITHIN GROUP (ORDER BY "LotFrontage") AS percentiles, AVG("LotFrontage") AS mean, COUNT(*) AS N FROM "train.csv"]
(Background on this error at: https://sqlalche.me/e/20/f405)
Probably because the column is non-integer.
We can
- Display more meaningful error message like
Ensure column is a number type.. Sete.modify_exception=True - Convert these type of columns to float and handle missing values/ null values.
@mehtamohit013 any updates?
I am not entirely sure about how to implement this:
Ensure column is a number type
Because every database may have different datatypes, so is there a way to infer data type from column or from schema, that is applicable to all the databases?
The same goes for conversion, as databases may have different syntax, so I doubt one piece of code will be able to run for every database @edublancas
as long as we cover duckdb and postgres, it's fine.
let's see what error does duckdb and postgres show. then implement logic that when it detects this issue, it adds some extra context like: "the column {column} isn't a valid number" and then display the original error.
here's an example: https://github.com/ploomber/jupysql/blob/02ba21cf47b5df6c2191bf5d60823eae0ce0b7e4/src/sql/connection/connection.py#L768
(note that in the example, where showing warnings, for this case, we need to raise an exception)
@edublancas I am trying to think of a method to get the column type of a particular column that is database agnostic and works for both files as well as tables, but I am unable to come up with one.
We can't use exceptions to filter out these queries, because it gives general ProgrammingError and the only way I think it is possible is to determine column type.
I initially thought of using %sqlcmd columns --table train.csv but it doesn't work with files such as .csv. Also raised an issue: #824
Any thoughts on this?
I don't understand why we can't use the exceptions
You can catch the exception and see if the error message matches a specific pattern. If so, you can handle each case and show a more meaningful error message.
that's what we're doing here: https://github.com/ploomber/jupysql/blob/02ba21cf47b5df6c2191bf5d60823eae0ce0b7e4/src/sql/connection/connection.py#L768