soda-sql icon indicating copy to clipboard operation
soda-sql copied to clipboard

`soda analyze` fails on first run against postgresql

Open sivy opened this issue 4 years ago • 5 comments

Describe the bug I tried to setup a new postgres warehouse with soda-sql cli and then run soda analyze against it.

To Reproduce Steps to reproduce the behavior:

  1. python3 -m venv .venv-soda
  2. source .venv-soda/bin/activate
  3. pip install soda-sql-postgresql soda-sql-snowflake
  4. soda create postgres -d thedb -u theuser
  5. Update ~/.soda/env_vars.yml with credentials
  6. soda analyze

Context

PostgreSQL version: 10

warehouse.yml:

name: postgres
connection:
  type: postgres
  host: the_host
  port: '5432'
  username: env_var(POSTGRES_USERNAME)
  password: env_var(POSTGRES_PASSWORD)
  database: thedb
  schema: public

Log:

% soda soda analyze                                                        
  | 2.1.0b18
  | Analyzing warehouse.yml ...
  | Querying warehouse for tables
  | Directory tables already exists
  | Executing SQL query: 
SELECT table_name 
FROM information_schema.tables 
WHERE lower(table_schema)='public'
  | SQL took 0:00:00.668961
  | Executing SQL query: 
SELECT column_name, data_type, is_nullable 
FROM information_schema.columns 
WHERE lower(table_name) = 'dual' 
  AND table_catalog = 'pmx' 
  AND table_schema = 'public'
  | SQL took 0:00:00.377569
  | Executing SQL query: 
SELECT 
  COUNT(CASE WHEN "dummy" ~* '^\-?[0-9]+$' THEN 1 END) AS number_whole,
  COUNT(CASE WHEN "dummy" ~* '^\-?[0-9]+\.[0-9]+$' THEN 1 END) AS number_decimal_point,
  COUNT(CASE WHEN "dummy" ~* '^\-?[0-9]+,[0-9]+$' THEN 1 END) AS number_decimal_comma,
  COUNT(CASE WHEN "dummy" ~* '^\-?\d+([\.,]\d+)? ?%$' THEN 1 END) AS number_percentage,
  COUNT(CASE WHEN "dummy" ~* '^\-?[0-9]+([\.][0-9]+)? ?%$' THEN 1 END) AS number_percentage_point,
  COUNT(CASE WHEN "dummy" ~* '^\-?[0-9]+([,][0-9]+)? ?%$' THEN 1 END) AS number_percentage_comma,
  COUNT(CASE WHEN "dummy" ~* '^(\$)? ?(\-)?([0-9]+[,])*([0-9]+)(\.[0-9]+)? ?(\$|usd|USD)?$' THEN 1 END) AS number_money_usd,
  COUNT(CASE WHEN "dummy" ~* '^(€)? ?(\-)?([0-9]+[ ])*([0-9]+)(,[0-9]+)? ?(€|eur|EUR)?$' THEN 1 END) AS number_money_eur,
  COUNT(CASE WHEN "dummy" ~* '^(£)? ?(\-)?([0-9]+[,])*([0-9]+)(\.[0-9]+)? ?(£|gbp|GBP)?$' THEN 1 END) AS number_money_gbp,
  COUNT(CASE WHEN "dummy" ~* '^(¥)? ?(\-)?([0-9]+[,])*([0-9]+)(\.[0-9]+)? ?(¥|rmb|RMB)?$' THEN 1 END) AS number_money_rmb,
  COUNT(CASE WHEN "dummy" ~* '^(CHf)? ?(\-)?([0-9]+[''])*([0-9]+)(\.[0-9]+)? ?(CHf|chf|CHF)?$' THEN 1 END) AS number_money_chf,
  COUNT(CASE WHEN "dummy" ~* '^(((\$)? ?(\-)?([0-9]+[,])*([0-9]+)(\.[0-9]+)? ?(\$|usd|USD)?)|((€)? ?(\-)?([0-9]+[ ])*([0-9]+)(,[0-9]+)? ?(€|eur|EUR)?)|((£)? ?(\-)?([0-9]+[,])*([0-9]+)(\.[0-9]+)? ?(£|gbp|GBP)?)|((¥)? ?(\-)?([0-9]+[,])*([0-9]+)(\.[0-9]+)? ?(¥|rmb|RMB)?)|((CHf)? ?(\-)?([0-9]+[''])*([0-9]+)(\.[0-9]+)? ?(CHf|chf|CHF)?))$' THEN 1 END) AS number_money,
  COUNT(CASE WHEN "dummy" ~* '^([1-9]|0[1-9]|[12][0-9]|3[01])[-\./]([1-9]|0[1-9]|1[012])[-\./](19|20)?[0-9][0-9]' THEN 1 END) AS date_eu,
  COUNT(CASE WHEN "dummy" ~* '^([1-9]|0[1-9]|1[012])[-\./]([1-9]|0[1-9]|[12][0-9]|3[01])[-\./](19|20)?[0-9][0-9]' THEN 1 END) AS date_us,
  COUNT(CASE WHEN "dummy" ~* '^(19|20)[0-9][0-9][-\./]?([1-9]|0[1-9]|1[012])[-\./]?([1-9]|0[1-9]|[12][0-9]|3[01])' THEN 1 END) AS date_inverse,
  COUNT(CASE WHEN "dummy" ~* '^([01][0-9]|2[0-3]):([0-5][0-9])$' THEN 1 END) AS time_24h,
  COUNT(CASE WHEN "dummy" ~* '^(1[0-2]|0?[1-9]):[0-5][0-9]$' THEN 1 END) AS time_12h,
  COUNT(CASE WHEN "dummy" ~* '([0-9]|1[0-9]|2[0-4])[:-]([0-9]|[0-5][0-9])([:-]([0-9]|[0-5][0-9])(,[0-9]+)?)?$' THEN 1 END) AS time,
  COUNT(CASE WHEN "dummy" ~* '^([1-9][0-9]{3}-((0[1-9]|1[0-2])-(0[1-9]|1[0-9]|2[0-8])|(0[13-9]|1[0-2])-(29|30)|(0[13578]|1[02])-31)|([1-9][0-9](0[48]|[2468][048]|[13579][26])|([2468][048]|[13579][26])00)-02-29)T([01][0-9]|2[0-3]):[0-5][0-9]:[0-5][0-9](\.[0-9]+)?(Z|[+-][01][0-9]:[0-5][0-9])?$' THEN 1 END) AS date_iso_8601,
  COUNT(CASE WHEN "dummy" ~* '^[0-9a-fA-F]{8}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{12}$' THEN 1 END) AS uuid,
  COUNT(CASE WHEN "dummy" ~* '^[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}$' THEN 1 END) AS ip_address,
  COUNT(CASE WHEN "dummy" ~* '^[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}$' THEN 1 END) AS ipv4_address,
  COUNT(CASE WHEN "dummy" ~* '^(([0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,7}:|([0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,5}(:[0-9a-fA-F]{1,4}){1,2}|([0-9a-fA-F]{1,4}:){1,4}(:[0-9a-fA-F]{1,4}){1,3}|([0-9a-fA-F]{1,4}:){1,3}(:[0-9a-fA-F]{1,4}){1,4}|([0-9a-fA-F]{1,4}:){1,2}(:[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:((:[0-9a-fA-F]{1,4}){1,6})|:((:[0-9a-fA-F]{1,4}){1,7}|:)|fe80:(:[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(ffff(:0{1,4}){0,1}:){0,1}((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])|([0-9a-fA-F]{1,4}:){1,4}:((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9]))$' THEN 1 END) AS ipv6_address,
  COUNT(CASE WHEN "dummy" ~* '^[A-Za-z0-9.-_%]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}$' THEN 1 END) AS email,
  COUNT(CASE WHEN "dummy" ~* '^((\+[0-9]{1,2}\s)?\(?[0-9]{3}\)?[\s.-])?[0-9]{3}[\s.-][0-9]{4}$' THEN 1 END) AS phone_number,
  COUNT(CASE WHEN "dummy" ~* '^[0-9]{14}|[0-9]{15}|[0-9]{16}|[0-9]{17}|[0-9]{18}|[0-9]{19}|([0-9]{4}-){3}[0-9]{4}|([0-9]{4} ){3}[0-9]{4}$' THEN 1 END) AS credit_card_number,
  COUNT("dummy") 
FROM (SELECT * FROM "public"."dual" LIMIT 1000) T
  | Exception: invalid regular expression: quantifier operand invalid
Traceback (most recent call last):
  File "/Volumes/Projects/DANG/clarity/.venv-soda/lib/python3.9/site-packages/sodasql/cli/cli.py", line 250, in analyze
    dataset_analyze_results = dataset_analyzer.analyze(warehouse, table_name)
  File "/Volumes/Projects/DANG/clarity/.venv-soda/lib/python3.9/site-packages/sodasql/dataset_analyzer.py", line 72, in analyze
    row = warehouse.sql_fetchone(
  File "/Volumes/Projects/DANG/clarity/.venv-soda/lib/python3.9/site-packages/sodasql/scan/warehouse.py", line 29, in sql_fetchone
    return sql_fetchone(self.connection, sql)
  File "/Volumes/Projects/DANG/clarity/.venv-soda/lib/python3.9/site-packages/sodasql/scan/db.py", line 22, in sql_fetchone
    return sql_fetchone_description(connection, sql)[0]
  File "/Volumes/Projects/DANG/clarity/.venv-soda/lib/python3.9/site-packages/sodasql/scan/db.py", line 35, in sql_fetchone_description
    cursor.execute(sql)
psycopg2.errors.InvalidRegularExpression: invalid regular expression: quantifier operand invalid

OS: MacOS 11.5.2

Python Version: Python 3.9.7

Soda SQL Version:

soda-sql-core==2.1.0b18
soda-sql-postgresql==2.1.0b18
soda-sql-snowflake==2.1.0b18

Warehouse Type: postgres

sivy avatar Oct 08 '21 17:10 sivy


create table dual
(
   dummy  varchar

);


SELECT
    COUNT(CASE WHEN "dummy" ~* '^\-?[0-9]+$' THEN 1 END) AS number_whole,
    COUNT(CASE WHEN "dummy" ~* '^\-?[0-9]+\.[0-9]+$' THEN 1 END) AS number_decimal_point,
    COUNT(CASE WHEN "dummy" ~* '^\-?[0-9]+,[0-9]+$' THEN 1 END) AS number_decimal_comma,
    COUNT(CASE WHEN "dummy" ~* '^\-?\d+([\.,]\d+)? ?%$' THEN 1 END) AS number_percentage,
    COUNT(CASE WHEN "dummy" ~* '^\-?[0-9]+([\.][0-9]+)? ?%$' THEN 1 END) AS number_percentage_point,
    COUNT(CASE WHEN "dummy" ~* '^\-?[0-9]+([,][0-9]+)? ?%$' THEN 1 END) AS number_percentage_comma,
    COUNT(CASE WHEN "dummy" ~* '^(\$)? ?(\-)?([0-9]+[,])*([0-9]+)(\.[0-9]+)? ?(\$|usd|USD)?$' THEN 1 END) AS number_money_usd,
    COUNT(CASE WHEN "dummy" ~* '^(€)? ?(\-)?([0-9]+[ ])*([0-9]+)(,[0-9]+)? ?(€|eur|EUR)?$' THEN 1 END) AS number_money_eur,
    COUNT(CASE WHEN "dummy" ~* '^(£)? ?(\-)?([0-9]+[,])*([0-9]+)(\.[0-9]+)? ?(£|gbp|GBP)?$' THEN 1 END) AS number_money_gbp,
    COUNT(CASE WHEN "dummy" ~* '^(¥)? ?(\-)?([0-9]+[,])*([0-9]+)(\.[0-9]+)? ?(¥|rmb|RMB)?$' THEN 1 END) AS number_money_rmb,
    COUNT(CASE WHEN "dummy" ~* '^(CHf)? ?(\-)?([0-9]+[''])*([0-9]+)(\.[0-9]+)? ?(CHf|chf|CHF)?$' THEN 1 END) AS number_money_chf,
    COUNT(CASE WHEN "dummy" ~* '^(((\$)? ?(\-)?([0-9]+[,])*([0-9]+)(\.[0-9]+)? ?(\$|usd|USD)?)|((€)? ?(\-)?([0-9]+[ ])*([0-9]+)(,[0-9]+)? ?(€|eur|EUR)?)|((£)? ?(\-)?([0-9]+[,])*([0-9]+)(\.[0-9]+)? ?(£|gbp|GBP)?)|((¥)? ?(\-)?([0-9]+[,])*([0-9]+)(\.[0-9]+)? ?(¥|rmb|RMB)?)|((CHf)? ?(\-)?([0-9]+[''])*([0-9]+)(\.[0-9]+)? ?(CHf|chf|CHF)?))$' THEN 1 END) AS number_money,
    COUNT(CASE WHEN "dummy" ~* '^([1-9]|0[1-9]|[12][0-9]|3[01])[-\./]([1-9]|0[1-9]|1[012])[-\./](19|20)?[0-9][0-9]' THEN 1 END) AS date_eu,
    COUNT(CASE WHEN "dummy" ~* '^([1-9]|0[1-9]|1[012])[-\./]([1-9]|0[1-9]|[12][0-9]|3[01])[-\./](19|20)?[0-9][0-9]' THEN 1 END) AS date_us,
    COUNT(CASE WHEN "dummy" ~* '^(19|20)[0-9][0-9][-\./]?([1-9]|0[1-9]|1[012])[-\./]?([1-9]|0[1-9]|[12][0-9]|3[01])' THEN 1 END) AS date_inverse,
    COUNT(CASE WHEN "dummy" ~* '^([01][0-9]|2[0-3]):([0-5][0-9])$' THEN 1 END) AS time_24h,
    COUNT(CASE WHEN "dummy" ~* '^(1[0-2]|0?[1-9]):[0-5][0-9]$' THEN 1 END) AS time_12h,
    COUNT(CASE WHEN "dummy" ~* '([0-9]|1[0-9]|2[0-4])[:-]([0-9]|[0-5][0-9])([:-]([0-9]|[0-5][0-9])(,[0-9]+)?)?$' THEN 1 END) AS time,
    COUNT(CASE WHEN "dummy" ~* '^([1-9][0-9]{3}-((0[1-9]|1[0-2])-(0[1-9]|1[0-9]|2[0-8])|(0[13-9]|1[0-2])-(29|30)|(0[13578]|1[02])-31)|([1-9][0-9](0[48]|[2468][048]|[13579][26])|([2468][048]|[13579][26])00)-02-29)T([01][0-9]|2[0-3]):[0-5][0-9]:[0-5][0-9](\.[0-9]+)?(Z|[+-][01][0-9]:[0-5][0-9])?$' THEN 1 END) AS date_iso_8601,
    COUNT(CASE WHEN "dummy" ~* '^[0-9a-fA-F]{8}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{12}$' THEN 1 END) AS uuid,
    COUNT(CASE WHEN "dummy" ~* '^[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}$' THEN 1 END) AS ip_address,
    COUNT(CASE WHEN "dummy" ~* '^[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}$' THEN 1 END) AS ipv4_address,
    COUNT(CASE WHEN "dummy" ~* '^(([0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,7}:|([0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,5}(:[0-9a-fA-F]{1,4}){1,2}|([0-9a-fA-F]{1,4}:){1,4}(:[0-9a-fA-F]{1,4}){1,3}|([0-9a-fA-F]{1,4}:){1,3}(:[0-9a-fA-F]{1,4}){1,4}|([0-9a-fA-F]{1,4}:){1,2}(:[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:((:[0-9a-fA-F]{1,4}){1,6})|:((:[0-9a-fA-F]{1,4}){1,7}|:)|fe80:(:[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(ffff(:0{1,4}){0,1}:){0,1}((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])|([0-9a-fA-F]{1,4}:){1,4}:((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9]))$' THEN 1 END) AS ipv6_address,
    COUNT(CASE WHEN "dummy" ~* '^[A-Za-z0-9.-_%]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}$' THEN 1 END) AS email,
    COUNT(CASE WHEN "dummy" ~* '^((\+[0-9]{1,2}\s)?\(?[0-9]{3}\)?[\s.-])?[0-9]{3}[\s.-][0-9]{4}$' THEN 1 END) AS phone_number,
    COUNT(CASE WHEN "dummy" ~* '^[0-9]{14}|[0-9]{15}|[0-9]{16}|[0-9]{17}|[0-9]{18}|[0-9]{19}|([0-9]{4}-){3}[0-9]{4}|([0-9]{4} ){3}[0-9]{4}$' THEN 1 END) AS credit_card_number,
    COUNT("dummy")
FROM (SELECT * FROM "public"."dual" LIMIT 1000) T;

Ran this on PostgreSQL 9.6.17 - but no errors. I'll try with 10.x to see if I can reproduce it.

vijaykiran avatar Oct 11 '21 12:10 vijaykiran

I just went through the initial setup as well but used docker pull postgres with v14.0. Everything went pretty smoothly - no errors like above so probably something specific to v10.

soda-sql-core 2.1.0b18 soda-sql-postgresql 2.1.0b18

Have you tried any other version above 10?

jasonicarter avatar Oct 11 '21 13:10 jasonicarter

I haven't yet, but I have asked our DBA for a copy of any PG configs so I can share. Might prove informative.

sivy avatar Oct 11 '21 15:10 sivy

quick update: our PG has standard_conforming_strings = off, which — according to the docs:

https://www.postgresql.org/docs/10/runtime-config-compatible.html

standard_conforming_strings (boolean) This controls whether ordinary string literals ('...') treat backslashes literally, as specified in the SQL standard. Beginning in PostgreSQL 9.1, the default is on (prior releases defaulted to off). Applications can check this parameter to determine how string literals will be processed. The presence of this parameter can also be taken as an indication that the escape string syntax (E'...') is supported. Escape string syntax (Section 4.1.2.2) should be used if an application desires backslashes to be treated as escape characters.

HTH

sivy avatar Oct 11 '21 18:10 sivy

I tried with 10.18 with standard_conforming_strings = off and running the analyse didn't resulting error for the table:

create table dual
(
   dummy  varchar

);

I'll try to change the regex strings to E`` syntax may be that will help. I'll keep this issue updated.

FYI - AFAICS, it is not recommended to set it to off.

vijaykiran avatar Oct 13 '21 10:10 vijaykiran