datahub Profiling Failed in Hive Ingestion with "scheme: sparksql"

Describe the bug We use AWS EMR cluster and AWS Glue as hive/spark metastore. When I execute this ingestion , I got errors as follows, looks like a problem with the double quotes on the table name:

'[2022-05-11 08:30:32,774] ERROR    {datahub.ingestion.source.ge_data_profiler:844} - Encountered exception while profiling '
           'db1.table1\n'
           'Traceback (most recent call last):\n'
           '  File "/tmp/datahub/ingest/venv-10f76e4f-22d1-4e82-ac86-d39bd2758741/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1276, '
           'in _execute_context\n'
           '    self.dialect.do_execute(\n'
           '  File "/tmp/datahub/ingest/venv-10f76e4f-22d1-4e82-ac86-d39bd2758741/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line '
           '608, in do_execute\n'
           '    cursor.execute(statement, parameters)\n'
           '  File "/tmp/datahub/ingest/venv-10f76e4f-22d1-4e82-ac86-d39bd2758741/lib/python3.9/site-packages/pyhive/hive.py", line 479, in execute\n'
           '    _check_status(response)\n'
           '  File "/tmp/datahub/ingest/venv-10f76e4f-22d1-4e82-ac86-d39bd2758741/lib/python3.9/site-packages/pyhive/hive.py", line 609, in '
           '_check_status\n'
           '    raise OperationalError(response)\n'
           'pyhive.exc.OperationalError: TExecuteStatementResp(status=TStatus(statusCode=3, '
           "infoMessages=['*org.apache.hive.service.cli.HiveSQLException:Error running query: org.apache.spark.sql.catalyst.parser.ParseException: "
           '\\nextraneous input \\\'"ge_temp_7318248a"\\\' expecting {\\\'ADD\\\', \\\'AFTER\\\', \\\'ALL\\\', \\\'ALTER\\\', \\\'ANALYZE\\\', '
           "\\'AND\\', \\'ANTI\\', \\'ANY\\', \\'ARCHIVE\\', \\'ARRAY\\', \\'AS\\', \\'ASC\\', \\'AT\\', \\'AUTHORIZATION\\', \\'BETWEEN\\', "
           "\\'BOTH\\', \\'BUCKET\\', \\'BUCKETS\\', \\'BY\\', \\'CACHE\\', \\'CASCADE\\', \\'CASE\\', \\'CAST\\', \\'CHANGE\\', \\'CHECK\\', "
           "\\'CLEAR\\', \\'CLUSTER\\', \\'CLUSTERED\\', \\'CODEGEN\\', \\'COLLATE\\', \\'COLLECTION\\', \\'COLUMN\\', \\'COLUMNS\\', \\'COMMENT\\', "
           "\\'COMMIT\\', \\'COMPACT\\', \\'COMPACTIONS\\', \\'COMPUTE\\', \\'CONCATENATE\\', \\'CONSTRAINT\\', \\'COST\\', \\'CREATE\\', "
           "\\'CROSS\\', \\'CUBE\\', \\'CURRENT\\', \\'CURRENT_DATE\\', \\'CURRENT_TIME\\', \\'CURRENT_TIMESTAMP\\', \\'CURRENT_USER\\', \\'DATA\\', "
           "\\'DATABASE\\', DATABASES, \\'DBPROPERTIES\\', \\'DEFINED\\', \\'DELETE\\', \\'DELIMITED\\', \\'DESC\\', \\'DESCRIBE\\', \\'DFS\\', "
           "\\'DIRECTORIES\\', \\'DIRECTORY\\', \\'DISTINCT\\', \\'DISTRIBUTE\\', \\'DIV\\', \\'DROP\\', \\'ELSE\\', \\'END\\', \\'ESCAPE\\', "
           "\\'ESCAPED\\', \\'EXCEPT\\', \\'EXCHANGE\\', \\'EXISTS\\', \\'EXPLAIN\\', \\'EXPORT\\', \\'EXTENDED\\', \\'EXTERNAL\\', \\'EXTRACT\\', "
           "\\'FALSE\\', \\'FETCH\\', \\'FIELDS\\', \\'FILTER\\', \\'FILEFORMAT\\', \\'FIRST\\', \\'FOLLOWING\\', \\'FOR\\', \\'FOREIGN\\', "
           "\\'FORMAT\\', \\'FORMATTED\\', \\'FROM\\', \\'FULL\\', \\'FUNCTION\\', \\'FUNCTIONS\\', \\'GLOBAL\\', \\'GRANT\\', \\'GROUP\\', "
           "\\'GROUPING\\', \\'HAVING\\', \\'IF\\', \\'IGNORE\\', \\'IMPORT\\', \\'IN\\', \\'INDEX\\', \\'INDEXES\\', \\'INNER\\', \\'INPATH\\', "
           "\\'INPUTFORMAT\\', \\'INSERT\\', \\'INTERSECT\\', \\'INTERVAL\\', \\'INTO\\', \\'IS\\', \\'ITEMS\\', \\'JOIN\\', \\'KEYS\\', \\'LAST\\', "
           "\\'LATERAL\\', \\'LAZY\\', \\'LEADING\\', \\'LEFT\\', \\'LIKE\\', \\'LIMIT\\', \\'LINES\\', \\'LIST\\', \\'LOAD\\', \\'LOCAL\\', "
           "\\'LOCATION\\', \\'LOCK\\', \\'LOCKS\\', \\'LOGICAL\\', \\'MACRO\\', \\'MAP\\', \\'MATCHED\\', \\'MERGE\\', \\'MSCK\\', \\'NAMESPACE\\', "
           "\\'NAMESPACES\\', \\'NATURAL\\', \\'NO\\', NOT, \\'NULL\\', \\'NULLS\\', \\'OF\\', \\'ON\\', \\'ONLY\\', \\'OPTION\\', \\'OPTIONS\\', "
           "\\'OR\\', \\'ORDER\\', \\'OUT\\', \\'OUTER\\', \\'OUTPUTFORMAT\\', \\'OVER\\', \\'OVERLAPS\\', \\'OVERLAY\\', \\'OVERWRITE\\', "
           "\\'PARTITION\\', \\'PARTITIONED\\', \\'PARTITIONS\\', \\'PERCENT\\', \\'PIVOT\\', \\'PLACING\\', \\'POSITION\\', \\'PRECEDING\\', "
           "\\'PRIMARY\\', \\'PRINCIPALS\\', \\'PROPERTIES\\', \\'PURGE\\', \\'QUERY\\', \\'RANGE\\', \\'RECORDREADER\\', \\'RECORDWRITER\\', "
           "\\'RECOVER\\', \\'REDUCE\\', \\'REFERENCES\\', \\'REFRESH\\', \\'RENAME\\', \\'REPAIR\\', \\'REPLACE\\', \\'RESET\\', \\'RESTRICT\\', "
           "\\'REVOKE\\', \\'RIGHT\\', RLIKE, \\'ROLE\\', \\'ROLES\\', \\'ROLLBACK\\', \\'ROLLUP\\', \\'ROW\\', \\'ROWS\\', \\'SCHEMA\\', "
           "\\'SELECT\\', \\'SEMI\\', \\'SEPARATED\\', \\'SERDE\\', \\'SERDEPROPERTIES\\', \\'SESSION_USER\\', \\'SET\\', \\'MINUS\\', \\'SETS\\', "
           "\\'SHOW\\', \\'SKEWED\\', \\'SOME\\', \\'SORT\\', \\'SORTED\\', \\'START\\', \\'STATISTICS\\', \\'STORED\\', \\'STRATIFY\\', "
           "\\'STRUCT\\', \\'SUBSTR\\', \\'SUBSTRING\\', \\'TABLE\\', \\'TABLES\\', \\'TABLESAMPLE\\', \\'TBLPROPERTIES\\', TEMPORARY, "
           "\\'TERMINATED\\', \\'THEN\\', \\'TIME\\', \\'TO\\', \\'TOUCH\\', \\'TRAILING\\', \\'TRANSACTION\\', \\'TRANSACTIONS\\', \\'TRANSFORM\\', "
           "\\'TRIM\\', \\'TRUE\\', \\'TRUNCATE\\', \\'TYPE\\', \\'UNARCHIVE\\', \\'UNBOUNDED\\', \\'UNCACHE\\', \\'UNION\\', \\'UNIQUE\\', "
           "\\'UNKNOWN\\', \\'UNLOCK\\', \\'UNSET\\', \\'UPDATE\\', \\'USE\\', \\'USER\\', \\'USING\\', \\'VALUES\\', \\'VIEW\\', \\'VIEWS\\', "
           "\\'WHEN\\', \\'WHERE\\', \\'WINDOW\\', \\'WITH\\', \\'ZONE\\', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 23)\\n\\n== SQL "
           '==\\nCREATE TEMPORARY TABLE "ge_temp_7318248a" AS SELECT * \\n-----------------------^^^\\nFROM '
           "`db1`.`table1`\\n LIMIT 10\\n:36:35', "

To Reproduce Hive Recipe:

source:
    type: hive
    config:
        host_port: 'sparksql_uri'
        scheme: 'sparksql'
        profiling:
            enabled: true
            allow_deny_patterns:
                allow:
                    - db1.table1
            limit: 10
        profile_pattern:
            allow:
                - db1.table2
        schema_pattern:
            allow:
                - db1
sink:
    type: datahub-rest
    config:
        server: 'datahub_uri'

Expected behavior Success to profile.

May 11 '22 08:05 duwanqiebi

@duwanqiebi

Can you please share the complete error stack trace ? It will be super helpful.

May 19 '22 08:05 mayurinehate

deleted

May 23 '22 01:05 duwanqiebi

@duwanqiebi Datahub uses great-expectations for profiling and it seems that great-expectations does not handle sparksql dialect similar to how it handles hive dialect leading to incorrect SQL statement being formed, i.e. temporary table name with quotes, whereas for hive as well as sparksql, quotes are not needed, just like you anticipated.

Also, it is mentioned in Considerations when using AWS Glue Data Catalog that Temporary Tables are not supported. So it seems the profiling may not work even after the Create Temporary Table SQL is fixed somehow.

Not an ideal solution, but have you tried profiling without specifying the limit ? Does it work ?

May 23 '22 11:05 mayurinehate

Thank you very much. I'll do it another way

May 24 '22 00:05 duwanqiebi

This issue is stale because it has been open for 30 days with no activity. If you believe this is still an issue on the latest DataHub release please leave a comment with the version that you tested it with. If this is a question/discussion please head to https://slack.datahubproject.io. For feature requests please use https://feature-requests.datahubproject.io

Sep 15 '22 06:09 github-actions[bot]

Hi @duwanqiebi, did you end up finding a workaround for this?

Sep 16 '22 18:09 hsheth2

Hi @duwanqiebi, did you end up finding a workaround for this?

Sorry for my late reply. We're using hive-server2 (default scheme), instead of using spark-thrift-server(scheme = spark-sql).

Sep 19 '22 03:09 duwanqiebi

This issue is stale because it has been open for 30 days with no activity. If you believe this is still an issue on the latest DataHub release please leave a comment with the version that you tested it with. If this is a question/discussion please head to https://slack.datahubproject.io. For feature requests please use https://feature-requests.datahubproject.io

Oct 20 '22 02:10 github-actions[bot]

datahub datahub copied to clipboard

Profiling Failed in Hive Ingestion with "scheme: sparksql"

datahub
datahub copied to clipboard