dbt-spark icon indicating copy to clipboard operation
dbt-spark copied to clipboard

[ADAP-873] [Regression] `1.6`does not work with `method: thrift` due to `pyhive`'s lack of `Cursor.fetchmany()` method

Open dataders opened this issue 2 years ago • 2 comments

Is this a regression in a recent version of dbt-spark?

  • [X] I believe this is a regression in dbt-spark functionality
  • [X] I have searched the existing issues, and I could not find an existing issue for this regression

Current Behavior

reports & discussion

@sid-deshmukh originally opened https://github.com/dbt-labs/dbt-external-tables/issues/234, but I believe this issue to be with dbt-spark, not dbt-external-tables.

@timvw and @jelstongreen also reported in a #db-databricks-and-spark thread in Community Slack there were experiencing similar issues

for reference, here's our internal dbt Labs Slack thread

stacktrace

compiling fails with the following stacktrace. dbt calls .get_result_from_cursor() which calls cursor.fetchall() which in PyHive is passed to it's Cursor._fetch_more() (pyhive/hive.py#L507), where it fails.

columns = [_unwrap_column(col, col_schema[1]) for col, col_schema in
           zip(response.results.columns, schema)]
full stacktrace
  File "/Users/user/PycharmProjects/dbt-data-pipeline/venv/lib/python3.8/site-packages/dbt/clients/jinja.py", line 302, in exception_handler
    yield
  File "/Users/user/PycharmProjects/dbt-data-pipeline/venv/lib/python3.8/site-packages/dbt/clients/jinja.py", line 257, in call_macro
    return macro(*args, **kwargs)
  File "/Users/user/PycharmProjects/dbt-data-pipeline/venv/lib/python3.8/site-packages/jinja2/runtime.py", line 763, in __call__
    return self._invoke(arguments, autoescape)
  File "/Users/user/PycharmProjects/dbt-data-pipeline/venv/lib/python3.8/site-packages/jinja2/runtime.py", line 777, in _invoke
    rv = self._func(*arguments)
  File "<template>", line 52, in macro
  File "/Users/user/PycharmProjects/dbt-data-pipeline/venv/lib/python3.8/site-packages/jinja2/sandbox.py", line 393, in call
    return __context.call(__obj, *args, **kwargs)
  File "/Users/user/PycharmProjects/dbt-data-pipeline/venv/lib/python3.8/site-packages/jinja2/runtime.py", line 298, in call
    return __obj(*args, **kwargs)
  File "/Users/user/PycharmProjects/dbt-data-pipeline/venv/lib/python3.8/site-packages/dbt/adapters/base/impl.py", line 290, in execute
    return self.connections.execute(sql=sql, auto_begin=auto_begin, fetch=fetch, limit=limit)
  File "/Users/user/PycharmProjects/dbt-data-pipeline/venv/lib/python3.8/site-packages/dbt/adapters/sql/connections.py", line 149, in execute
    table = self.get_result_from_cursor(cursor, limit)
  File "/Users/user/PycharmProjects/dbt-data-pipeline/venv/lib/python3.8/site-packages/dbt/adapters/sql/connections.py", line 129, in get_result_from_cursor
    rows = cursor.fetchall()
  File "/Users/user/PycharmProjects/dbt-data-pipeline/venv/lib/python3.8/site-packages/dbt/adapters/spark/connections.py", line 197, in fetchall
    return self._cursor.fetchall()
  File "/Users/user/PycharmProjects/dbt-data-pipeline/venv/lib/python3.8/site-packages/pyhive/common.py", line 137, in fetchall
    return list(iter(self.fetchone, None))
  File "/Users/user/PycharmProjects/dbt-data-pipeline/venv/lib/python3.8/site-packages/pyhive/common.py", line 106, in fetchone
    self._fetch_while(lambda: not self._data and self._state !=
                      self._STATE_FINISHED)
  File "/Users/user/PycharmProjects/dbt-data-pipeline/venv/lib/python3.8/site-packages/pyhive/common.py", line 46, in _fetch_while
    self._fetch_more()
  File "/Users/user/PycharmProjects/dbt-data-pipeline/venv/lib/python3.8/site-packages/pyhive/hive.py", line 481, in _fetch_more
    zip(response.results.columns, schema)]
TypeError: 'NoneType' object is not iterable

Expected/Previous Behavior

things work (ostensibly because pyhive's cursor.fetch() does not invoke ._fetchmore() like .fetchmany() does

Steps To Reproduce

  1. dbt-spark 1.6.0
  2. using method: thrift
  3. doing any sort of jinja compilation (which is almost anything)

Relevant log output

No response

Environment

- OS:
- Python:
- dbt-core (working version):
- dbt-spark (working version):
- dbt-core (regression version):
- dbt-spark (regression version):

Additional Context

this problem ever happening again could be solved by https://github.com/dbt-labs/dbt-core/issues/8471

dataders avatar Sep 06 '23 20:09 dataders

I have seen this happen with sparksession as well when using the "show" command...

timvw avatar Sep 19 '23 10:09 timvw

Not sure if there's still interest on this, but looking into the PyHive code it doesn't seem to handle queries with empty result sets correctly. I've forked and issued a PR here but it seems the library's been pretty much unsupported for a few years now

With the changes Jinja is able to compile and results are correctly received

❯ dbt run-operation stage_external_sources --log-level debug --print
01:23:03  Running with dbt=1.6.7
01:23:03  running dbt with arguments {'printer_width': '80', 'indirect_selection': 'eager', 'write_json': 'True', 'log_cache_events': 'False', 'partial_parse': 'True', 'cache_selected_only': 'False', 'profiles_dir': '/home/lmarcondes/.dbt', 'fail_fast': 'True', 'warn_error': 'True', 'log_path': '/home/lmarcondes/Documents/projects/votacao-2022/src/capivara-etl-models/capivara/logs', 'debug': 'False', 'version_check': 'True', 'use_colors': 'True', 'use_experimental_parser': 'False', 'no_print': 'None', 'quiet': 'False', 'log_format': 'default', 'static_parser': 'True', 'warn_error_options': 'WarnErrorOptions(include=[], exclude=[])', 'introspect': 'True', 'target_path': 'None', 'invocation_command': 'dbt run-operation stage_external_sources --log-level debug --print', 'send_anonymous_usage_stats': 'False'}
01:23:03  Registered adapter: spark=1.6.0
01:23:03  checksum: a051d2bc88277f3be74306f0393e0e8e6f29724fe11a36c13ebfccd4b87560d8, vars: {}, profile: , target: , version: 1.6.7
01:23:03  Partial parsing enabled: 0 files deleted, 0 files added, 0 files changed.
01:23:03  Partial parsing enabled, no changes found, skipping parsing
01:23:03  Found 1 model, 5 sources, 0 exposures, 0 metrics, 557 macros, 0 groups, 0 semantic models
01:23:03  Acquiring new spark connection 'macro_stage_external_sources'
01:23:03  Spark adapter: NotImplemented: add_begin_query
01:23:03  Spark adapter: NotImplemented: commit
01:23:03  1 of 5 START external source default.caged_for
01:23:03  On "macro_stage_external_sources": cache miss for schema ".default", this is inefficient
01:23:03  Using spark connection "macro_stage_external_sources"
01:23:03  On macro_stage_external_sources: /* {"app": "dbt", "dbt_version": "1.6.7", "profile_name": "capivara", "target_name": "local", "connection_name": "macro_stage_external_sources"} */
show table extended in default like '*'
  
01:23:03  Opening a new connection, currently in state init
01:23:03  Spark adapter: Poll status: 2, query complete
01:23:03  SQL status: OK in 0.0 seconds
01:23:03  While listing relations in database=, schema=default, found: caged_exc, caged_for, caged_mov, links_2o_turno
01:23:03  1 of 5 (1) refresh table default.caged_for
01:23:03  Using spark connection "macro_stage_external_sources"
01:23:03  On macro_stage_external_sources: /* {"app": "dbt", "dbt_version": "1.6.7", "profile_name": "capivara", "target_name": "local", "connection_name": "macro_stage_external_sources"} */

                 
        refresh table default.caged_for
    
            
01:23:08  Spark adapter: Poll status: 1, sleeping
01:23:13  Spark adapter: Poll status: 1, sleeping
01:23:18  Spark adapter: Poll status: 1, sleeping
01:23:23  Spark adapter: Poll status: 1, sleeping
01:23:28  Spark adapter: Poll status: 1, sleeping
01:23:33  Spark adapter: Poll status: 1, sleeping
01:23:38  Spark adapter: Poll status: 1, sleeping
01:23:43  Spark adapter: Poll status: 1, sleeping
01:23:48  Spark adapter: Poll status: 1, sleeping
01:23:53  Spark adapter: Poll status: 1, sleeping
01:23:58  Spark adapter: Poll status: 1, sleeping
01:24:03  Spark adapter: Poll status: 1, sleeping
01:24:08  Spark adapter: Poll status: 1, sleeping
01:24:12  Spark adapter: Poll status: 2, query complete
01:24:12  SQL status: OK in 69.0 seconds
01:24:12  1 of 5 (1) OK
01:24:12  2 of 5 START external source default.caged_mov
01:24:12  2 of 5 (1) refresh table default.caged_mov
01:24:12  Using spark connection "macro_stage_external_sources"
01:24:12  On macro_stage_external_sources: /* {"app": "dbt", "dbt_version": "1.6.7", "profile_name": "capivara", "target_name": "local", "connection_name": "macro_stage_external_sources"} */

lmarcondes avatar Nov 02 '23 01:11 lmarcondes

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

github-actions[bot] avatar Apr 30 '24 01:04 github-actions[bot]

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

github-actions[bot] avatar May 07 '24 01:05 github-actions[bot]