jupysql icon indicating copy to clipboard operation
jupysql copied to clipboard

Boxplot error in non-integer column

Open neelasha23 opened this issue 2 years ago • 5 comments

Dataset: https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques

%sqlplot boxplot --table train.csv --column LotFrontage

This results in error:

ProgrammingError: (duckdb.BinderException) Binder Error: No function matches the given name and argument types 'avg(VARCHAR)'. You might need to add explicit type casts.
	Candidate functions:
	avg(DECIMAL) -> DECIMAL
	avg(SMALLINT) -> DOUBLE
	avg(INTEGER) -> DOUBLE
	avg(BIGINT) -> DOUBLE
	avg(HUGEINT) -> DOUBLE
	avg(DOUBLE) -> DOUBLE

LINE 1: ...RDER BY "LotFrontage") AS percentiles, AVG("LotFrontage") AS mean, COUNT(*) AS...
                                                  ^
[SQL: SELECT PERCENTILE_DISC(LIST_VALUE(0.25, 0.50, 0.75)) WITHIN GROUP (ORDER BY "LotFrontage") AS percentiles, AVG("LotFrontage") AS mean, COUNT(*) AS N FROM "train.csv"]
(Background on this error at: https://sqlalche.me/e/20/f405)

Probably because the column is non-integer.

We can

  1. Display more meaningful error message like Ensure column is a number type.. Set e.modify_exception=True
  2. Convert these type of columns to float and handle missing values/ null values.

neelasha23 avatar May 03 '23 12:05 neelasha23

@mehtamohit013 any updates?

edublancas avatar Aug 17 '23 14:08 edublancas

I am not entirely sure about how to implement this:

Ensure column is a number type

Because every database may have different datatypes, so is there a way to infer data type from column or from schema, that is applicable to all the databases?

The same goes for conversion, as databases may have different syntax, so I doubt one piece of code will be able to run for every database @edublancas

mehtamohit013 avatar Aug 17 '23 15:08 mehtamohit013

as long as we cover duckdb and postgres, it's fine.

let's see what error does duckdb and postgres show. then implement logic that when it detects this issue, it adds some extra context like: "the column {column} isn't a valid number" and then display the original error.

here's an example: https://github.com/ploomber/jupysql/blob/02ba21cf47b5df6c2191bf5d60823eae0ce0b7e4/src/sql/connection/connection.py#L768

(note that in the example, where showing warnings, for this case, we need to raise an exception)

edublancas avatar Aug 18 '23 00:08 edublancas

@edublancas I am trying to think of a method to get the column type of a particular column that is database agnostic and works for both files as well as tables, but I am unable to come up with one.

We can't use exceptions to filter out these queries, because it gives general ProgrammingError and the only way I think it is possible is to determine column type.

I initially thought of using %sqlcmd columns --table train.csv but it doesn't work with files such as .csv. Also raised an issue: #824

Any thoughts on this?

mehtamohit013 avatar Aug 22 '23 18:08 mehtamohit013

I don't understand why we can't use the exceptions

You can catch the exception and see if the error message matches a specific pattern. If so, you can handle each case and show a more meaningful error message.

that's what we're doing here: https://github.com/ploomber/jupysql/blob/02ba21cf47b5df6c2191bf5d60823eae0ce0b7e4/src/sql/connection/connection.py#L768

edublancas avatar Aug 22 '23 21:08 edublancas