datahub icon indicating copy to clipboard operation
datahub copied to clipboard

Profiling Failed in Hive Ingestion with "scheme: sparksql"

Open duwanqiebi opened this issue 3 years ago • 7 comments

Describe the bug We use AWS EMR cluster and AWS Glue as hive/spark metastore. When I execute this ingestion , I got errors as follows, looks like a problem with the double quotes on the table name:

'[2022-05-11 08:30:32,774] ERROR    {datahub.ingestion.source.ge_data_profiler:844} - Encountered exception while profiling '
           'db1.table1\n'
           'Traceback (most recent call last):\n'
           '  File "/tmp/datahub/ingest/venv-10f76e4f-22d1-4e82-ac86-d39bd2758741/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1276, '
           'in _execute_context\n'
           '    self.dialect.do_execute(\n'
           '  File "/tmp/datahub/ingest/venv-10f76e4f-22d1-4e82-ac86-d39bd2758741/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line '
           '608, in do_execute\n'
           '    cursor.execute(statement, parameters)\n'
           '  File "/tmp/datahub/ingest/venv-10f76e4f-22d1-4e82-ac86-d39bd2758741/lib/python3.9/site-packages/pyhive/hive.py", line 479, in execute\n'
           '    _check_status(response)\n'
           '  File "/tmp/datahub/ingest/venv-10f76e4f-22d1-4e82-ac86-d39bd2758741/lib/python3.9/site-packages/pyhive/hive.py", line 609, in '
           '_check_status\n'
           '    raise OperationalError(response)\n'
           'pyhive.exc.OperationalError: TExecuteStatementResp(status=TStatus(statusCode=3, '
           "infoMessages=['*org.apache.hive.service.cli.HiveSQLException:Error running query: org.apache.spark.sql.catalyst.parser.ParseException: "
           '\\nextraneous input \\\'"ge_temp_7318248a"\\\' expecting {\\\'ADD\\\', \\\'AFTER\\\', \\\'ALL\\\', \\\'ALTER\\\', \\\'ANALYZE\\\', '
           "\\'AND\\', \\'ANTI\\', \\'ANY\\', \\'ARCHIVE\\', \\'ARRAY\\', \\'AS\\', \\'ASC\\', \\'AT\\', \\'AUTHORIZATION\\', \\'BETWEEN\\', "
           "\\'BOTH\\', \\'BUCKET\\', \\'BUCKETS\\', \\'BY\\', \\'CACHE\\', \\'CASCADE\\', \\'CASE\\', \\'CAST\\', \\'CHANGE\\', \\'CHECK\\', "
           "\\'CLEAR\\', \\'CLUSTER\\', \\'CLUSTERED\\', \\'CODEGEN\\', \\'COLLATE\\', \\'COLLECTION\\', \\'COLUMN\\', \\'COLUMNS\\', \\'COMMENT\\', "
           "\\'COMMIT\\', \\'COMPACT\\', \\'COMPACTIONS\\', \\'COMPUTE\\', \\'CONCATENATE\\', \\'CONSTRAINT\\', \\'COST\\', \\'CREATE\\', "
           "\\'CROSS\\', \\'CUBE\\', \\'CURRENT\\', \\'CURRENT_DATE\\', \\'CURRENT_TIME\\', \\'CURRENT_TIMESTAMP\\', \\'CURRENT_USER\\', \\'DATA\\', "
           "\\'DATABASE\\', DATABASES, \\'DBPROPERTIES\\', \\'DEFINED\\', \\'DELETE\\', \\'DELIMITED\\', \\'DESC\\', \\'DESCRIBE\\', \\'DFS\\', "
           "\\'DIRECTORIES\\', \\'DIRECTORY\\', \\'DISTINCT\\', \\'DISTRIBUTE\\', \\'DIV\\', \\'DROP\\', \\'ELSE\\', \\'END\\', \\'ESCAPE\\', "
           "\\'ESCAPED\\', \\'EXCEPT\\', \\'EXCHANGE\\', \\'EXISTS\\', \\'EXPLAIN\\', \\'EXPORT\\', \\'EXTENDED\\', \\'EXTERNAL\\', \\'EXTRACT\\', "
           "\\'FALSE\\', \\'FETCH\\', \\'FIELDS\\', \\'FILTER\\', \\'FILEFORMAT\\', \\'FIRST\\', \\'FOLLOWING\\', \\'FOR\\', \\'FOREIGN\\', "
           "\\'FORMAT\\', \\'FORMATTED\\', \\'FROM\\', \\'FULL\\', \\'FUNCTION\\', \\'FUNCTIONS\\', \\'GLOBAL\\', \\'GRANT\\', \\'GROUP\\', "
           "\\'GROUPING\\', \\'HAVING\\', \\'IF\\', \\'IGNORE\\', \\'IMPORT\\', \\'IN\\', \\'INDEX\\', \\'INDEXES\\', \\'INNER\\', \\'INPATH\\', "
           "\\'INPUTFORMAT\\', \\'INSERT\\', \\'INTERSECT\\', \\'INTERVAL\\', \\'INTO\\', \\'IS\\', \\'ITEMS\\', \\'JOIN\\', \\'KEYS\\', \\'LAST\\', "
           "\\'LATERAL\\', \\'LAZY\\', \\'LEADING\\', \\'LEFT\\', \\'LIKE\\', \\'LIMIT\\', \\'LINES\\', \\'LIST\\', \\'LOAD\\', \\'LOCAL\\', "
           "\\'LOCATION\\', \\'LOCK\\', \\'LOCKS\\', \\'LOGICAL\\', \\'MACRO\\', \\'MAP\\', \\'MATCHED\\', \\'MERGE\\', \\'MSCK\\', \\'NAMESPACE\\', "
           "\\'NAMESPACES\\', \\'NATURAL\\', \\'NO\\', NOT, \\'NULL\\', \\'NULLS\\', \\'OF\\', \\'ON\\', \\'ONLY\\', \\'OPTION\\', \\'OPTIONS\\', "
           "\\'OR\\', \\'ORDER\\', \\'OUT\\', \\'OUTER\\', \\'OUTPUTFORMAT\\', \\'OVER\\', \\'OVERLAPS\\', \\'OVERLAY\\', \\'OVERWRITE\\', "
           "\\'PARTITION\\', \\'PARTITIONED\\', \\'PARTITIONS\\', \\'PERCENT\\', \\'PIVOT\\', \\'PLACING\\', \\'POSITION\\', \\'PRECEDING\\', "
           "\\'PRIMARY\\', \\'PRINCIPALS\\', \\'PROPERTIES\\', \\'PURGE\\', \\'QUERY\\', \\'RANGE\\', \\'RECORDREADER\\', \\'RECORDWRITER\\', "
           "\\'RECOVER\\', \\'REDUCE\\', \\'REFERENCES\\', \\'REFRESH\\', \\'RENAME\\', \\'REPAIR\\', \\'REPLACE\\', \\'RESET\\', \\'RESTRICT\\', "
           "\\'REVOKE\\', \\'RIGHT\\', RLIKE, \\'ROLE\\', \\'ROLES\\', \\'ROLLBACK\\', \\'ROLLUP\\', \\'ROW\\', \\'ROWS\\', \\'SCHEMA\\', "
           "\\'SELECT\\', \\'SEMI\\', \\'SEPARATED\\', \\'SERDE\\', \\'SERDEPROPERTIES\\', \\'SESSION_USER\\', \\'SET\\', \\'MINUS\\', \\'SETS\\', "
           "\\'SHOW\\', \\'SKEWED\\', \\'SOME\\', \\'SORT\\', \\'SORTED\\', \\'START\\', \\'STATISTICS\\', \\'STORED\\', \\'STRATIFY\\', "
           "\\'STRUCT\\', \\'SUBSTR\\', \\'SUBSTRING\\', \\'TABLE\\', \\'TABLES\\', \\'TABLESAMPLE\\', \\'TBLPROPERTIES\\', TEMPORARY, "
           "\\'TERMINATED\\', \\'THEN\\', \\'TIME\\', \\'TO\\', \\'TOUCH\\', \\'TRAILING\\', \\'TRANSACTION\\', \\'TRANSACTIONS\\', \\'TRANSFORM\\', "
           "\\'TRIM\\', \\'TRUE\\', \\'TRUNCATE\\', \\'TYPE\\', \\'UNARCHIVE\\', \\'UNBOUNDED\\', \\'UNCACHE\\', \\'UNION\\', \\'UNIQUE\\', "
           "\\'UNKNOWN\\', \\'UNLOCK\\', \\'UNSET\\', \\'UPDATE\\', \\'USE\\', \\'USER\\', \\'USING\\', \\'VALUES\\', \\'VIEW\\', \\'VIEWS\\', "
           "\\'WHEN\\', \\'WHERE\\', \\'WINDOW\\', \\'WITH\\', \\'ZONE\\', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 23)\\n\\n== SQL "
           '==\\nCREATE TEMPORARY TABLE "ge_temp_7318248a" AS SELECT * \\n-----------------------^^^\\nFROM '
           "`db1`.`table1`\\n LIMIT 10\\n:36:35', "

To Reproduce Hive Recipe:

source:
    type: hive
    config:
        host_port: 'sparksql_uri'
        scheme: 'sparksql'
        profiling:
            enabled: true
            allow_deny_patterns:
                allow:
                    - db1.table1
            limit: 10
        profile_pattern:
            allow:
                - db1.table2
        schema_pattern:
            allow:
                - db1
sink:
    type: datahub-rest
    config:
        server: 'datahub_uri'

Expected behavior Success to profile.

duwanqiebi avatar May 11 '22 08:05 duwanqiebi

@duwanqiebi

Can you please share the complete error stack trace ? It will be super helpful.

mayurinehate avatar May 19 '22 08:05 mayurinehate

deleted

duwanqiebi avatar May 23 '22 01:05 duwanqiebi

@duwanqiebi Datahub uses great-expectations for profiling and it seems that great-expectations does not handle sparksql dialect similar to how it handles hive dialect leading to incorrect SQL statement being formed, i.e. temporary table name with quotes, whereas for hive as well as sparksql, quotes are not needed, just like you anticipated.

Also, it is mentioned in Considerations when using AWS Glue Data Catalog that Temporary Tables are not supported. So it seems the profiling may not work even after the Create Temporary Table SQL is fixed somehow.

Not an ideal solution, but have you tried profiling without specifying the limit ? Does it work ?

mayurinehate avatar May 23 '22 11:05 mayurinehate

Thank you very much. I'll do it another way

duwanqiebi avatar May 24 '22 00:05 duwanqiebi

This issue is stale because it has been open for 30 days with no activity. If you believe this is still an issue on the latest DataHub release please leave a comment with the version that you tested it with. If this is a question/discussion please head to https://slack.datahubproject.io. For feature requests please use https://feature-requests.datahubproject.io

github-actions[bot] avatar Sep 15 '22 06:09 github-actions[bot]

Hi @duwanqiebi, did you end up finding a workaround for this?

hsheth2 avatar Sep 16 '22 18:09 hsheth2

Hi @duwanqiebi, did you end up finding a workaround for this?

Sorry for my late reply. We're using hive-server2 (default scheme), instead of using spark-thrift-server(scheme = spark-sql).

duwanqiebi avatar Sep 19 '22 03:09 duwanqiebi

This issue is stale because it has been open for 30 days with no activity. If you believe this is still an issue on the latest DataHub release please leave a comment with the version that you tested it with. If this is a question/discussion please head to https://slack.datahubproject.io. For feature requests please use https://feature-requests.datahubproject.io

github-actions[bot] avatar Oct 20 '22 02:10 github-actions[bot]