datahub
datahub copied to clipboard
Profiling Failed in Hive Ingestion with "scheme: sparksql"
Describe the bug We use AWS EMR cluster and AWS Glue as hive/spark metastore. When I execute this ingestion , I got errors as follows, looks like a problem with the double quotes on the table name:
'[2022-05-11 08:30:32,774] ERROR {datahub.ingestion.source.ge_data_profiler:844} - Encountered exception while profiling '
'db1.table1\n'
'Traceback (most recent call last):\n'
' File "/tmp/datahub/ingest/venv-10f76e4f-22d1-4e82-ac86-d39bd2758741/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1276, '
'in _execute_context\n'
' self.dialect.do_execute(\n'
' File "/tmp/datahub/ingest/venv-10f76e4f-22d1-4e82-ac86-d39bd2758741/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line '
'608, in do_execute\n'
' cursor.execute(statement, parameters)\n'
' File "/tmp/datahub/ingest/venv-10f76e4f-22d1-4e82-ac86-d39bd2758741/lib/python3.9/site-packages/pyhive/hive.py", line 479, in execute\n'
' _check_status(response)\n'
' File "/tmp/datahub/ingest/venv-10f76e4f-22d1-4e82-ac86-d39bd2758741/lib/python3.9/site-packages/pyhive/hive.py", line 609, in '
'_check_status\n'
' raise OperationalError(response)\n'
'pyhive.exc.OperationalError: TExecuteStatementResp(status=TStatus(statusCode=3, '
"infoMessages=['*org.apache.hive.service.cli.HiveSQLException:Error running query: org.apache.spark.sql.catalyst.parser.ParseException: "
'\\nextraneous input \\\'"ge_temp_7318248a"\\\' expecting {\\\'ADD\\\', \\\'AFTER\\\', \\\'ALL\\\', \\\'ALTER\\\', \\\'ANALYZE\\\', '
"\\'AND\\', \\'ANTI\\', \\'ANY\\', \\'ARCHIVE\\', \\'ARRAY\\', \\'AS\\', \\'ASC\\', \\'AT\\', \\'AUTHORIZATION\\', \\'BETWEEN\\', "
"\\'BOTH\\', \\'BUCKET\\', \\'BUCKETS\\', \\'BY\\', \\'CACHE\\', \\'CASCADE\\', \\'CASE\\', \\'CAST\\', \\'CHANGE\\', \\'CHECK\\', "
"\\'CLEAR\\', \\'CLUSTER\\', \\'CLUSTERED\\', \\'CODEGEN\\', \\'COLLATE\\', \\'COLLECTION\\', \\'COLUMN\\', \\'COLUMNS\\', \\'COMMENT\\', "
"\\'COMMIT\\', \\'COMPACT\\', \\'COMPACTIONS\\', \\'COMPUTE\\', \\'CONCATENATE\\', \\'CONSTRAINT\\', \\'COST\\', \\'CREATE\\', "
"\\'CROSS\\', \\'CUBE\\', \\'CURRENT\\', \\'CURRENT_DATE\\', \\'CURRENT_TIME\\', \\'CURRENT_TIMESTAMP\\', \\'CURRENT_USER\\', \\'DATA\\', "
"\\'DATABASE\\', DATABASES, \\'DBPROPERTIES\\', \\'DEFINED\\', \\'DELETE\\', \\'DELIMITED\\', \\'DESC\\', \\'DESCRIBE\\', \\'DFS\\', "
"\\'DIRECTORIES\\', \\'DIRECTORY\\', \\'DISTINCT\\', \\'DISTRIBUTE\\', \\'DIV\\', \\'DROP\\', \\'ELSE\\', \\'END\\', \\'ESCAPE\\', "
"\\'ESCAPED\\', \\'EXCEPT\\', \\'EXCHANGE\\', \\'EXISTS\\', \\'EXPLAIN\\', \\'EXPORT\\', \\'EXTENDED\\', \\'EXTERNAL\\', \\'EXTRACT\\', "
"\\'FALSE\\', \\'FETCH\\', \\'FIELDS\\', \\'FILTER\\', \\'FILEFORMAT\\', \\'FIRST\\', \\'FOLLOWING\\', \\'FOR\\', \\'FOREIGN\\', "
"\\'FORMAT\\', \\'FORMATTED\\', \\'FROM\\', \\'FULL\\', \\'FUNCTION\\', \\'FUNCTIONS\\', \\'GLOBAL\\', \\'GRANT\\', \\'GROUP\\', "
"\\'GROUPING\\', \\'HAVING\\', \\'IF\\', \\'IGNORE\\', \\'IMPORT\\', \\'IN\\', \\'INDEX\\', \\'INDEXES\\', \\'INNER\\', \\'INPATH\\', "
"\\'INPUTFORMAT\\', \\'INSERT\\', \\'INTERSECT\\', \\'INTERVAL\\', \\'INTO\\', \\'IS\\', \\'ITEMS\\', \\'JOIN\\', \\'KEYS\\', \\'LAST\\', "
"\\'LATERAL\\', \\'LAZY\\', \\'LEADING\\', \\'LEFT\\', \\'LIKE\\', \\'LIMIT\\', \\'LINES\\', \\'LIST\\', \\'LOAD\\', \\'LOCAL\\', "
"\\'LOCATION\\', \\'LOCK\\', \\'LOCKS\\', \\'LOGICAL\\', \\'MACRO\\', \\'MAP\\', \\'MATCHED\\', \\'MERGE\\', \\'MSCK\\', \\'NAMESPACE\\', "
"\\'NAMESPACES\\', \\'NATURAL\\', \\'NO\\', NOT, \\'NULL\\', \\'NULLS\\', \\'OF\\', \\'ON\\', \\'ONLY\\', \\'OPTION\\', \\'OPTIONS\\', "
"\\'OR\\', \\'ORDER\\', \\'OUT\\', \\'OUTER\\', \\'OUTPUTFORMAT\\', \\'OVER\\', \\'OVERLAPS\\', \\'OVERLAY\\', \\'OVERWRITE\\', "
"\\'PARTITION\\', \\'PARTITIONED\\', \\'PARTITIONS\\', \\'PERCENT\\', \\'PIVOT\\', \\'PLACING\\', \\'POSITION\\', \\'PRECEDING\\', "
"\\'PRIMARY\\', \\'PRINCIPALS\\', \\'PROPERTIES\\', \\'PURGE\\', \\'QUERY\\', \\'RANGE\\', \\'RECORDREADER\\', \\'RECORDWRITER\\', "
"\\'RECOVER\\', \\'REDUCE\\', \\'REFERENCES\\', \\'REFRESH\\', \\'RENAME\\', \\'REPAIR\\', \\'REPLACE\\', \\'RESET\\', \\'RESTRICT\\', "
"\\'REVOKE\\', \\'RIGHT\\', RLIKE, \\'ROLE\\', \\'ROLES\\', \\'ROLLBACK\\', \\'ROLLUP\\', \\'ROW\\', \\'ROWS\\', \\'SCHEMA\\', "
"\\'SELECT\\', \\'SEMI\\', \\'SEPARATED\\', \\'SERDE\\', \\'SERDEPROPERTIES\\', \\'SESSION_USER\\', \\'SET\\', \\'MINUS\\', \\'SETS\\', "
"\\'SHOW\\', \\'SKEWED\\', \\'SOME\\', \\'SORT\\', \\'SORTED\\', \\'START\\', \\'STATISTICS\\', \\'STORED\\', \\'STRATIFY\\', "
"\\'STRUCT\\', \\'SUBSTR\\', \\'SUBSTRING\\', \\'TABLE\\', \\'TABLES\\', \\'TABLESAMPLE\\', \\'TBLPROPERTIES\\', TEMPORARY, "
"\\'TERMINATED\\', \\'THEN\\', \\'TIME\\', \\'TO\\', \\'TOUCH\\', \\'TRAILING\\', \\'TRANSACTION\\', \\'TRANSACTIONS\\', \\'TRANSFORM\\', "
"\\'TRIM\\', \\'TRUE\\', \\'TRUNCATE\\', \\'TYPE\\', \\'UNARCHIVE\\', \\'UNBOUNDED\\', \\'UNCACHE\\', \\'UNION\\', \\'UNIQUE\\', "
"\\'UNKNOWN\\', \\'UNLOCK\\', \\'UNSET\\', \\'UPDATE\\', \\'USE\\', \\'USER\\', \\'USING\\', \\'VALUES\\', \\'VIEW\\', \\'VIEWS\\', "
"\\'WHEN\\', \\'WHERE\\', \\'WINDOW\\', \\'WITH\\', \\'ZONE\\', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 23)\\n\\n== SQL "
'==\\nCREATE TEMPORARY TABLE "ge_temp_7318248a" AS SELECT * \\n-----------------------^^^\\nFROM '
"`db1`.`table1`\\n LIMIT 10\\n:36:35', "
To Reproduce Hive Recipe:
source:
type: hive
config:
host_port: 'sparksql_uri'
scheme: 'sparksql'
profiling:
enabled: true
allow_deny_patterns:
allow:
- db1.table1
limit: 10
profile_pattern:
allow:
- db1.table2
schema_pattern:
allow:
- db1
sink:
type: datahub-rest
config:
server: 'datahub_uri'
Expected behavior Success to profile.
@duwanqiebi
Can you please share the complete error stack trace ? It will be super helpful.
deleted
@duwanqiebi
Datahub uses great-expectations for profiling and it seems that great-expectations does not handle sparksql dialect similar to how it handles hive dialect leading to incorrect SQL statement being formed, i.e. temporary table name with quotes, whereas for hive as well as sparksql, quotes are not needed, just like you anticipated.
Also, it is mentioned in Considerations when using AWS Glue Data Catalog that Temporary Tables are not supported. So it seems the profiling may not work even after the Create Temporary Table SQL is fixed somehow.
Not an ideal solution, but have you tried profiling without specifying the limit ? Does it work ?
Thank you very much. I'll do it another way
This issue is stale because it has been open for 30 days with no activity. If you believe this is still an issue on the latest DataHub release please leave a comment with the version that you tested it with. If this is a question/discussion please head to https://slack.datahubproject.io. For feature requests please use https://feature-requests.datahubproject.io
Hi @duwanqiebi, did you end up finding a workaround for this?
Hi @duwanqiebi, did you end up finding a workaround for this?
Sorry for my late reply. We're using hive-server2 (default scheme), instead of using spark-thrift-server(scheme = spark-sql).
This issue is stale because it has been open for 30 days with no activity. If you believe this is still an issue on the latest DataHub release please leave a comment with the version that you tested it with. If this is a question/discussion please head to https://slack.datahubproject.io. For feature requests please use https://feature-requests.datahubproject.io