`soda analyze` fails on first run against postgresql
Describe the bug
I tried to setup a new postgres warehouse with soda-sql cli and then run soda analyze against it.
To Reproduce Steps to reproduce the behavior:
python3 -m venv .venv-sodasource .venv-soda/bin/activatepip install soda-sql-postgresql soda-sql-snowflakesoda create postgres -d thedb -u theuser- Update
~/.soda/env_vars.ymlwith credentials soda analyze
Context
PostgreSQL version: 10
warehouse.yml:
name: postgres
connection:
type: postgres
host: the_host
port: '5432'
username: env_var(POSTGRES_USERNAME)
password: env_var(POSTGRES_PASSWORD)
database: thedb
schema: public
Log:
% soda soda analyze
| 2.1.0b18
| Analyzing warehouse.yml ...
| Querying warehouse for tables
| Directory tables already exists
| Executing SQL query:
SELECT table_name
FROM information_schema.tables
WHERE lower(table_schema)='public'
| SQL took 0:00:00.668961
| Executing SQL query:
SELECT column_name, data_type, is_nullable
FROM information_schema.columns
WHERE lower(table_name) = 'dual'
AND table_catalog = 'pmx'
AND table_schema = 'public'
| SQL took 0:00:00.377569
| Executing SQL query:
SELECT
COUNT(CASE WHEN "dummy" ~* '^\-?[0-9]+$' THEN 1 END) AS number_whole,
COUNT(CASE WHEN "dummy" ~* '^\-?[0-9]+\.[0-9]+$' THEN 1 END) AS number_decimal_point,
COUNT(CASE WHEN "dummy" ~* '^\-?[0-9]+,[0-9]+$' THEN 1 END) AS number_decimal_comma,
COUNT(CASE WHEN "dummy" ~* '^\-?\d+([\.,]\d+)? ?%$' THEN 1 END) AS number_percentage,
COUNT(CASE WHEN "dummy" ~* '^\-?[0-9]+([\.][0-9]+)? ?%$' THEN 1 END) AS number_percentage_point,
COUNT(CASE WHEN "dummy" ~* '^\-?[0-9]+([,][0-9]+)? ?%$' THEN 1 END) AS number_percentage_comma,
COUNT(CASE WHEN "dummy" ~* '^(\$)? ?(\-)?([0-9]+[,])*([0-9]+)(\.[0-9]+)? ?(\$|usd|USD)?$' THEN 1 END) AS number_money_usd,
COUNT(CASE WHEN "dummy" ~* '^(€)? ?(\-)?([0-9]+[ ])*([0-9]+)(,[0-9]+)? ?(€|eur|EUR)?$' THEN 1 END) AS number_money_eur,
COUNT(CASE WHEN "dummy" ~* '^(£)? ?(\-)?([0-9]+[,])*([0-9]+)(\.[0-9]+)? ?(£|gbp|GBP)?$' THEN 1 END) AS number_money_gbp,
COUNT(CASE WHEN "dummy" ~* '^(¥)? ?(\-)?([0-9]+[,])*([0-9]+)(\.[0-9]+)? ?(¥|rmb|RMB)?$' THEN 1 END) AS number_money_rmb,
COUNT(CASE WHEN "dummy" ~* '^(CHf)? ?(\-)?([0-9]+[''])*([0-9]+)(\.[0-9]+)? ?(CHf|chf|CHF)?$' THEN 1 END) AS number_money_chf,
COUNT(CASE WHEN "dummy" ~* '^(((\$)? ?(\-)?([0-9]+[,])*([0-9]+)(\.[0-9]+)? ?(\$|usd|USD)?)|((€)? ?(\-)?([0-9]+[ ])*([0-9]+)(,[0-9]+)? ?(€|eur|EUR)?)|((£)? ?(\-)?([0-9]+[,])*([0-9]+)(\.[0-9]+)? ?(£|gbp|GBP)?)|((¥)? ?(\-)?([0-9]+[,])*([0-9]+)(\.[0-9]+)? ?(¥|rmb|RMB)?)|((CHf)? ?(\-)?([0-9]+[''])*([0-9]+)(\.[0-9]+)? ?(CHf|chf|CHF)?))$' THEN 1 END) AS number_money,
COUNT(CASE WHEN "dummy" ~* '^([1-9]|0[1-9]|[12][0-9]|3[01])[-\./]([1-9]|0[1-9]|1[012])[-\./](19|20)?[0-9][0-9]' THEN 1 END) AS date_eu,
COUNT(CASE WHEN "dummy" ~* '^([1-9]|0[1-9]|1[012])[-\./]([1-9]|0[1-9]|[12][0-9]|3[01])[-\./](19|20)?[0-9][0-9]' THEN 1 END) AS date_us,
COUNT(CASE WHEN "dummy" ~* '^(19|20)[0-9][0-9][-\./]?([1-9]|0[1-9]|1[012])[-\./]?([1-9]|0[1-9]|[12][0-9]|3[01])' THEN 1 END) AS date_inverse,
COUNT(CASE WHEN "dummy" ~* '^([01][0-9]|2[0-3]):([0-5][0-9])$' THEN 1 END) AS time_24h,
COUNT(CASE WHEN "dummy" ~* '^(1[0-2]|0?[1-9]):[0-5][0-9]$' THEN 1 END) AS time_12h,
COUNT(CASE WHEN "dummy" ~* '([0-9]|1[0-9]|2[0-4])[:-]([0-9]|[0-5][0-9])([:-]([0-9]|[0-5][0-9])(,[0-9]+)?)?$' THEN 1 END) AS time,
COUNT(CASE WHEN "dummy" ~* '^([1-9][0-9]{3}-((0[1-9]|1[0-2])-(0[1-9]|1[0-9]|2[0-8])|(0[13-9]|1[0-2])-(29|30)|(0[13578]|1[02])-31)|([1-9][0-9](0[48]|[2468][048]|[13579][26])|([2468][048]|[13579][26])00)-02-29)T([01][0-9]|2[0-3]):[0-5][0-9]:[0-5][0-9](\.[0-9]+)?(Z|[+-][01][0-9]:[0-5][0-9])?$' THEN 1 END) AS date_iso_8601,
COUNT(CASE WHEN "dummy" ~* '^[0-9a-fA-F]{8}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{12}$' THEN 1 END) AS uuid,
COUNT(CASE WHEN "dummy" ~* '^[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}$' THEN 1 END) AS ip_address,
COUNT(CASE WHEN "dummy" ~* '^[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}$' THEN 1 END) AS ipv4_address,
COUNT(CASE WHEN "dummy" ~* '^(([0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,7}:|([0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,5}(:[0-9a-fA-F]{1,4}){1,2}|([0-9a-fA-F]{1,4}:){1,4}(:[0-9a-fA-F]{1,4}){1,3}|([0-9a-fA-F]{1,4}:){1,3}(:[0-9a-fA-F]{1,4}){1,4}|([0-9a-fA-F]{1,4}:){1,2}(:[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:((:[0-9a-fA-F]{1,4}){1,6})|:((:[0-9a-fA-F]{1,4}){1,7}|:)|fe80:(:[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(ffff(:0{1,4}){0,1}:){0,1}((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])|([0-9a-fA-F]{1,4}:){1,4}:((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9]))$' THEN 1 END) AS ipv6_address,
COUNT(CASE WHEN "dummy" ~* '^[A-Za-z0-9.-_%]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}$' THEN 1 END) AS email,
COUNT(CASE WHEN "dummy" ~* '^((\+[0-9]{1,2}\s)?\(?[0-9]{3}\)?[\s.-])?[0-9]{3}[\s.-][0-9]{4}$' THEN 1 END) AS phone_number,
COUNT(CASE WHEN "dummy" ~* '^[0-9]{14}|[0-9]{15}|[0-9]{16}|[0-9]{17}|[0-9]{18}|[0-9]{19}|([0-9]{4}-){3}[0-9]{4}|([0-9]{4} ){3}[0-9]{4}$' THEN 1 END) AS credit_card_number,
COUNT("dummy")
FROM (SELECT * FROM "public"."dual" LIMIT 1000) T
| Exception: invalid regular expression: quantifier operand invalid
Traceback (most recent call last):
File "/Volumes/Projects/DANG/clarity/.venv-soda/lib/python3.9/site-packages/sodasql/cli/cli.py", line 250, in analyze
dataset_analyze_results = dataset_analyzer.analyze(warehouse, table_name)
File "/Volumes/Projects/DANG/clarity/.venv-soda/lib/python3.9/site-packages/sodasql/dataset_analyzer.py", line 72, in analyze
row = warehouse.sql_fetchone(
File "/Volumes/Projects/DANG/clarity/.venv-soda/lib/python3.9/site-packages/sodasql/scan/warehouse.py", line 29, in sql_fetchone
return sql_fetchone(self.connection, sql)
File "/Volumes/Projects/DANG/clarity/.venv-soda/lib/python3.9/site-packages/sodasql/scan/db.py", line 22, in sql_fetchone
return sql_fetchone_description(connection, sql)[0]
File "/Volumes/Projects/DANG/clarity/.venv-soda/lib/python3.9/site-packages/sodasql/scan/db.py", line 35, in sql_fetchone_description
cursor.execute(sql)
psycopg2.errors.InvalidRegularExpression: invalid regular expression: quantifier operand invalid
OS: MacOS 11.5.2
Python Version: Python 3.9.7
Soda SQL Version:
soda-sql-core==2.1.0b18
soda-sql-postgresql==2.1.0b18
soda-sql-snowflake==2.1.0b18
Warehouse Type: postgres
create table dual
(
dummy varchar
);
SELECT
COUNT(CASE WHEN "dummy" ~* '^\-?[0-9]+$' THEN 1 END) AS number_whole,
COUNT(CASE WHEN "dummy" ~* '^\-?[0-9]+\.[0-9]+$' THEN 1 END) AS number_decimal_point,
COUNT(CASE WHEN "dummy" ~* '^\-?[0-9]+,[0-9]+$' THEN 1 END) AS number_decimal_comma,
COUNT(CASE WHEN "dummy" ~* '^\-?\d+([\.,]\d+)? ?%$' THEN 1 END) AS number_percentage,
COUNT(CASE WHEN "dummy" ~* '^\-?[0-9]+([\.][0-9]+)? ?%$' THEN 1 END) AS number_percentage_point,
COUNT(CASE WHEN "dummy" ~* '^\-?[0-9]+([,][0-9]+)? ?%$' THEN 1 END) AS number_percentage_comma,
COUNT(CASE WHEN "dummy" ~* '^(\$)? ?(\-)?([0-9]+[,])*([0-9]+)(\.[0-9]+)? ?(\$|usd|USD)?$' THEN 1 END) AS number_money_usd,
COUNT(CASE WHEN "dummy" ~* '^(€)? ?(\-)?([0-9]+[ ])*([0-9]+)(,[0-9]+)? ?(€|eur|EUR)?$' THEN 1 END) AS number_money_eur,
COUNT(CASE WHEN "dummy" ~* '^(£)? ?(\-)?([0-9]+[,])*([0-9]+)(\.[0-9]+)? ?(£|gbp|GBP)?$' THEN 1 END) AS number_money_gbp,
COUNT(CASE WHEN "dummy" ~* '^(¥)? ?(\-)?([0-9]+[,])*([0-9]+)(\.[0-9]+)? ?(¥|rmb|RMB)?$' THEN 1 END) AS number_money_rmb,
COUNT(CASE WHEN "dummy" ~* '^(CHf)? ?(\-)?([0-9]+[''])*([0-9]+)(\.[0-9]+)? ?(CHf|chf|CHF)?$' THEN 1 END) AS number_money_chf,
COUNT(CASE WHEN "dummy" ~* '^(((\$)? ?(\-)?([0-9]+[,])*([0-9]+)(\.[0-9]+)? ?(\$|usd|USD)?)|((€)? ?(\-)?([0-9]+[ ])*([0-9]+)(,[0-9]+)? ?(€|eur|EUR)?)|((£)? ?(\-)?([0-9]+[,])*([0-9]+)(\.[0-9]+)? ?(£|gbp|GBP)?)|((¥)? ?(\-)?([0-9]+[,])*([0-9]+)(\.[0-9]+)? ?(¥|rmb|RMB)?)|((CHf)? ?(\-)?([0-9]+[''])*([0-9]+)(\.[0-9]+)? ?(CHf|chf|CHF)?))$' THEN 1 END) AS number_money,
COUNT(CASE WHEN "dummy" ~* '^([1-9]|0[1-9]|[12][0-9]|3[01])[-\./]([1-9]|0[1-9]|1[012])[-\./](19|20)?[0-9][0-9]' THEN 1 END) AS date_eu,
COUNT(CASE WHEN "dummy" ~* '^([1-9]|0[1-9]|1[012])[-\./]([1-9]|0[1-9]|[12][0-9]|3[01])[-\./](19|20)?[0-9][0-9]' THEN 1 END) AS date_us,
COUNT(CASE WHEN "dummy" ~* '^(19|20)[0-9][0-9][-\./]?([1-9]|0[1-9]|1[012])[-\./]?([1-9]|0[1-9]|[12][0-9]|3[01])' THEN 1 END) AS date_inverse,
COUNT(CASE WHEN "dummy" ~* '^([01][0-9]|2[0-3]):([0-5][0-9])$' THEN 1 END) AS time_24h,
COUNT(CASE WHEN "dummy" ~* '^(1[0-2]|0?[1-9]):[0-5][0-9]$' THEN 1 END) AS time_12h,
COUNT(CASE WHEN "dummy" ~* '([0-9]|1[0-9]|2[0-4])[:-]([0-9]|[0-5][0-9])([:-]([0-9]|[0-5][0-9])(,[0-9]+)?)?$' THEN 1 END) AS time,
COUNT(CASE WHEN "dummy" ~* '^([1-9][0-9]{3}-((0[1-9]|1[0-2])-(0[1-9]|1[0-9]|2[0-8])|(0[13-9]|1[0-2])-(29|30)|(0[13578]|1[02])-31)|([1-9][0-9](0[48]|[2468][048]|[13579][26])|([2468][048]|[13579][26])00)-02-29)T([01][0-9]|2[0-3]):[0-5][0-9]:[0-5][0-9](\.[0-9]+)?(Z|[+-][01][0-9]:[0-5][0-9])?$' THEN 1 END) AS date_iso_8601,
COUNT(CASE WHEN "dummy" ~* '^[0-9a-fA-F]{8}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{12}$' THEN 1 END) AS uuid,
COUNT(CASE WHEN "dummy" ~* '^[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}$' THEN 1 END) AS ip_address,
COUNT(CASE WHEN "dummy" ~* '^[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}$' THEN 1 END) AS ipv4_address,
COUNT(CASE WHEN "dummy" ~* '^(([0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,7}:|([0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,5}(:[0-9a-fA-F]{1,4}){1,2}|([0-9a-fA-F]{1,4}:){1,4}(:[0-9a-fA-F]{1,4}){1,3}|([0-9a-fA-F]{1,4}:){1,3}(:[0-9a-fA-F]{1,4}){1,4}|([0-9a-fA-F]{1,4}:){1,2}(:[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:((:[0-9a-fA-F]{1,4}){1,6})|:((:[0-9a-fA-F]{1,4}){1,7}|:)|fe80:(:[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(ffff(:0{1,4}){0,1}:){0,1}((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])|([0-9a-fA-F]{1,4}:){1,4}:((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9]))$' THEN 1 END) AS ipv6_address,
COUNT(CASE WHEN "dummy" ~* '^[A-Za-z0-9.-_%]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}$' THEN 1 END) AS email,
COUNT(CASE WHEN "dummy" ~* '^((\+[0-9]{1,2}\s)?\(?[0-9]{3}\)?[\s.-])?[0-9]{3}[\s.-][0-9]{4}$' THEN 1 END) AS phone_number,
COUNT(CASE WHEN "dummy" ~* '^[0-9]{14}|[0-9]{15}|[0-9]{16}|[0-9]{17}|[0-9]{18}|[0-9]{19}|([0-9]{4}-){3}[0-9]{4}|([0-9]{4} ){3}[0-9]{4}$' THEN 1 END) AS credit_card_number,
COUNT("dummy")
FROM (SELECT * FROM "public"."dual" LIMIT 1000) T;
Ran this on PostgreSQL 9.6.17 - but no errors. I'll try with 10.x to see if I can reproduce it.
I just went through the initial setup as well but used docker pull postgres with v14.0. Everything went pretty smoothly - no errors like above so probably something specific to v10.
soda-sql-core 2.1.0b18
soda-sql-postgresql 2.1.0b18
Have you tried any other version above 10?
I haven't yet, but I have asked our DBA for a copy of any PG configs so I can share. Might prove informative.
quick update: our PG has standard_conforming_strings = off, which — according to the docs:
https://www.postgresql.org/docs/10/runtime-config-compatible.html
standard_conforming_strings (boolean)This controls whether ordinary string literals ('...') treat backslashes literally, as specified in the SQL standard. Beginning in PostgreSQL 9.1, the default is on (prior releases defaulted to off). Applications can check this parameter to determine how string literals will be processed. The presence of this parameter can also be taken as an indication that the escape string syntax (E'...') is supported. Escape string syntax (Section 4.1.2.2) should be used if an application desires backslashes to be treated as escape characters.
HTH
I tried with 10.18 with standard_conforming_strings = off and running the analyse didn't resulting error for the table:
create table dual
(
dummy varchar
);
I'll try to change the regex strings to E`` syntax may be that will help. I'll keep this issue updated.
FYI - AFAICS, it is not recommended to set it to off.