DataCleaner icon indicating copy to clipboard operation
DataCleaner copied to clipboard

NullCheckFilter: no rows processed in large oracle database

Open jhorcicka opened this issue 9 years ago • 6 comments

In a simple job consisting of "oracle datastore", "null check filter" and "csv writer (or value distribution)" where the table contains "many" rows (cca 200000), the results are only produced when both branches of NullCheckFilter are used.

When I use only "NOT_NULL" then there is no result.

I tried that on the same remote oracle database but on a different table with only a few records (cca 50) and it worked fine. I also tried that on my/local postgres datastore with 300000 rows and it worked fine as well. So, it seems the problem is related to oracle database table with many records.

EDIT: Additional observation, if "Consider empty string as null" is not checked (being false) then it all works fine. But with "true" it produces nothing (without using both branches).

null-filter-issue-01

null-filter-issue-02

null-filter-issue-04

jhorcicka avatar Jul 06 '16 14:07 jhorcicka

My guess would be that this is either some wild race condition, or a MetaModel query optimization issue, but we probably need more research first.

@kaspersorensen have you ever seen something as fantastically weird as this?

LosD avatar Jul 06 '16 14:07 LosD

With the blackouts, I guess there's confidential information in there, but if you can somehow scrub the execution log for that information, it would help at getting more eyes on it (preferably with column names changed to i.e. col_1, col_2, col_3 instead of all being named xxxxx,xxxxx,xxxxx so we can cross check column names in different rows)

LosD avatar Jul 06 '16 14:07 LosD

I haven't seen it either no. And I agree that most likely this is related to the query optimizable nature of null check filter.

If you can reproduce then try to turn on Debug logging for org.datacleaner.job.runner to see what query gets fired in the end.

kaspersorensen avatar Jul 06 '16 14:07 kaspersorensen

execution_log filter_config value_distribution_config

jhorcicka avatar Jul 06 '16 14:07 jhorcicka

Yeah okay, not much gained there. Probably better use @kaspersorensen's suggestion.

LosD avatar Jul 06 '16 14:07 LosD

I am not going to post the whole debug output but the problem is indeed in the final SQL query which ends with:

... FROM <myTableName> WHERE <dateColumn> IS NOT NULL AND (<dateColumn> <> '' OR <dateColumn> IS NULL)

where the last part

AND (<dateColumn> <> '' OR <dateColumn> IS NULL)

is added by setting the checkbox condition "Consider empty string as null".

jhorcicka avatar Jul 07 '16 07:07 jhorcicka