PyHive
PyHive copied to clipboard
get_table_names() returns incorrect list of table names when querying Spark SQL Thriftserver
Hi,
I found out that the method get_table_names() in class HiveDialect in sqlalchemy_hive.py does not return correct names of tables. The result of query show tables are 3-tuples in the form (schema_name , table_name , isTemporary). You need to get index 1 for table_names, but you get index 0. So it returns a list of duplicates of the schema_name.
What can we do to fix this?
Spark thrift claims to be compatible with Hive, so I think the correct fix is to edit the spark code. But if they don't want to match Hive, the next best option is probably to figure out a hard coded list of column names compatible with all spark and hive versions. In my opinion, this is inferior since it pushes the problem down to every user of Spark.
On Aug 27, 2017 11:13 PM, "Reza Baktash" [email protected] wrote:
Hi, I found out that the method get_table_names() in class HiveDialect in sqlalchemy_hive.py does not return correct names of tables. The result of query show tables are 3-tuples in the form (schema_name , table_name , isTemporary). You need to get index 1 for table_names, but you get index 0. So it returns a list of duplicates of the schema_name. What can we do to fix this?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dropbox/PyHive/issues/150, or mute the thread https://github.com/notifications/unsubscribe-auth/AB7QYtn6GirL_mQEDKuOiGWXWy3uTf2Qks5sclqkgaJpZM4PEHju .
Referencing the line after checking it out: https://github.com/dropbox/PyHive/blob/44717f7b5f1b09d615a0e8d14c9c0ae15404c576/pyhive/sqlalchemy_hive.py#L303
Does Spark SQL return the same format regardless of SHOW TABLES IN {schema} and USE {schema}; SHOW TABLES; ?
@mistercrunch No! They both have the same format.
@jingw Is it possible to add a third option [sparksql] to pyhive alongside [hive] and [presto]? I know it would be a heavily redundant code. But it seems necessary for Spark to include database/schema name in SHOW TABLES result, because Sparks' temp tables don't have a schema and so they are returned for every SHOW TABLES query and they need to be distinguished from others.
Meanwhile we can ask Spark to decide about SHOW TABLES command and its Hive compatibility. Whether to support or not support it.
@jingw how about
return [row[-1] for row in connection.execute(query)] ?
Would you approve such a PR?
Nice, that sounds easier than what I was thinking :)
I guess that'll break if someone decides to add another column (maybe show tables includes the table comment some day). Oh well. Both approaches I can think of (column names, column index) have been problematic.
Re spark dialect, I don't currently use sparksql, so I'd rather not add extra modules or branching logic that I can't test.
@mistercrunch show tables returns three fields as I mentioned in the first comment (schema_name , table_name , isTemporary). row[-1] will not return table name.
I guess that leaves us with having a list of possible column names. #68 changed from row.tab_name to row[0] due to some incompatibility. We could crawl through the Hive/Spark source code for every possible name of this field. Alternatively we could branch on len(row), but that seems even more fragile. Ultimately everything's a workaround for the root inconsistency.
That feels fragile / magical to me :/
Arguably we should be using the Metastore thrift client for the metadata fetching, but then we'd need to overload the connection string with extra parameters to connect to the Metastore. Not ideal either.
What about create a new Pypi package for SQLAlchemy dialect sparksql:// that would essentially derive the hive:// dialect here and override this one method here. It would allow for it to diverge more in the future if needed.
So, what do we do now? :D
Can we merge PR and close this bug ?
bump
Based on above commits, changed small part of "./superset/models/core.py" This is a workaround though, it works.
def get_all_table_names_in_schema(
self,
schema: str,
cache: bool = False,
cache_timeout: Optional[int] = None,
force: bool = False,
) -> List[utils.DatasourceName]:
"""Parameters need to be passed as keyword arguments.
For unused parameters, they are referenced in
cache_util.memoized_func decorator.
:param schema: schema name
:param cache: whether cache is enabled for the function
:param cache_timeout: timeout in seconds for the cache
:param force: whether to force refresh the cache
:return: list of tables
"""
try:
# Workaround for table names in Spark Thrift server
if self.db_engine_spec == db_engine_specs.hive.HiveEngineSpec:
engine = self.get_sqla_engine()
if schema:
rs = engine.execute('show tables in %s' % schema).fetchall()
else:
rs = engine.execute('show tables').fetchall()
return [utils.DatasourceName(table=x[1], schema=x[0]) for x in rs]
tables = self.db_engine_spec.get_table_names(
database=self, inspector=self.inspector, schema=schema
)
return [
utils.DatasourceName(table=table, schema=schema) for table in tables
]
except Exception as ex: # pylint: disable=broad-except
logger.warning(ex)
@cache_util.memoized_func(
key=lambda self, schema, *args, **kwargs: f"db:{self.id}:schema:{schema}:view_list", # type: ignore
cache=cache_manager.data_cache,
)