PyHive get_table_names() returns incorrect list of table names when querying Spark SQL Thriftserver

Hi, I found out that the method get_table_names() in class HiveDialect in sqlalchemy_hive.py does not return correct names of tables. The result of query show tables are 3-tuples in the form (schema_name , table_name , isTemporary). You need to get index 1 for table_names, but you get index 0. So it returns a list of duplicates of the schema_name. What can we do to fix this?

Aug 28 '17 06:08 rezabaktash

Spark thrift claims to be compatible with Hive, so I think the correct fix is to edit the spark code. But if they don't want to match Hive, the next best option is probably to figure out a hard coded list of column names compatible with all spark and hive versions. In my opinion, this is inferior since it pushes the problem down to every user of Spark.

On Aug 27, 2017 11:13 PM, "Reza Baktash" [email protected] wrote:

Hi, I found out that the method get_table_names() in class HiveDialect in sqlalchemy_hive.py does not return correct names of tables. The result of query show tables are 3-tuples in the form (schema_name , table_name , isTemporary). You need to get index 1 for table_names, but you get index 0. So it returns a list of duplicates of the schema_name. What can we do to fix this?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dropbox/PyHive/issues/150, or mute the thread https://github.com/notifications/unsubscribe-auth/AB7QYtn6GirL_mQEDKuOiGWXWy3uTf2Qks5sclqkgaJpZM4PEHju .

Aug 28 '17 17:08 jingw

Referencing the line after checking it out: https://github.com/dropbox/PyHive/blob/44717f7b5f1b09d615a0e8d14c9c0ae15404c576/pyhive/sqlalchemy_hive.py#L303

Does Spark SQL return the same format regardless of SHOW TABLES IN {schema} and USE {schema}; SHOW TABLES; ?

Aug 29 '17 03:08 mistercrunch

@mistercrunch No! They both have the same format.

Aug 29 '17 06:08 rezabaktash

@jingw Is it possible to add a third option [sparksql] to pyhive alongside [hive] and [presto]? I know it would be a heavily redundant code. But it seems necessary for Spark to include database/schema name in SHOW TABLES result, because Sparks' temp tables don't have a schema and so they are returned for every SHOW TABLES query and they need to be distinguished from others. Meanwhile we can ask Spark to decide about SHOW TABLES command and its Hive compatibility. Whether to support or not support it.

Aug 29 '17 09:08 rezabaktash

@jingw how about return [row[-1] for row in connection.execute(query)] ?

Would you approve such a PR?

Aug 29 '17 18:08 mistercrunch

Nice, that sounds easier than what I was thinking :)

Aug 29 '17 18:08 jingw

I guess that'll break if someone decides to add another column (maybe show tables includes the table comment some day). Oh well. Both approaches I can think of (column names, column index) have been problematic.

Re spark dialect, I don't currently use sparksql, so I'd rather not add extra modules or branching logic that I can't test.

Aug 29 '17 18:08 jingw

@mistercrunch show tables returns three fields as I mentioned in the first comment (schema_name , table_name , isTemporary). row[-1] will not return table name.

Aug 29 '17 20:08 rezabaktash

I guess that leaves us with having a list of possible column names. #68 changed from row.tab_name to row[0] due to some incompatibility. We could crawl through the Hive/Spark source code for every possible name of this field. Alternatively we could branch on len(row), but that seems even more fragile. Ultimately everything's a workaround for the root inconsistency.

Aug 29 '17 20:08 jingw

That feels fragile / magical to me :/

Aug 29 '17 22:08 jingw

Arguably we should be using the Metastore thrift client for the metadata fetching, but then we'd need to overload the connection string with extra parameters to connect to the Metastore. Not ideal either.

What about create a new Pypi package for SQLAlchemy dialect sparksql:// that would essentially derive the hive:// dialect here and override this one method here. It would allow for it to diverge more in the future if needed.

Aug 30 '17 04:08 mistercrunch

So, what do we do now? :D

Sep 05 '17 05:09 rezabaktash

Can we merge PR and close this bug ?

May 29 '18 14:05 maver1ck

bump

Sep 08 '18 02:09 tooptoop4

Based on above commits, changed small part of "./superset/models/core.py" This is a workaround though, it works.

def get_all_table_names_in_schema(
        self,
        schema: str,
        cache: bool = False,
        cache_timeout: Optional[int] = None,
        force: bool = False,
    ) -> List[utils.DatasourceName]:
         """Parameters need to be passed as keyword arguments.

        For unused parameters, they are referenced in
        cache_util.memoized_func decorator.

        :param schema: schema name
        :param cache: whether cache is enabled for the function
        :param cache_timeout: timeout in seconds for the cache
        :param force: whether to force refresh the cache
        :return: list of tables
        """
        try:
            # Workaround for table names in Spark Thrift server 
            if self.db_engine_spec == db_engine_specs.hive.HiveEngineSpec:
                engine =  self.get_sqla_engine()
                if schema:
                    rs = engine.execute('show tables in %s' % schema).fetchall()
                else:
                    rs = engine.execute('show tables').fetchall()
                return  [utils.DatasourceName(table=x[1], schema=x[0]) for x in rs]

            tables = self.db_engine_spec.get_table_names(
                database=self, inspector=self.inspector, schema=schema
            )

            return [
                utils.DatasourceName(table=table, schema=schema) for table in tables
            ]
        except Exception as ex:  # pylint: disable=broad-except
            logger.warning(ex)

    @cache_util.memoized_func(
        key=lambda self, schema, *args, **kwargs: f"db:{self.id}:schema:{schema}:view_list",  # type: ignore
        cache=cache_manager.data_cache,
    )

Mar 23 '21 03:03 kwontaeheon

PyHive PyHive copied to clipboard

get_table_names() returns incorrect list of table names when querying Spark SQL Thriftserver

PyHive
PyHive copied to clipboard