PyHive icon indicating copy to clipboard operation
PyHive copied to clipboard

get_table_names() returns incorrect list of table names when querying Spark SQL Thriftserver

Open rezabaktash opened this issue 8 years ago • 15 comments

Hi, I found out that the method get_table_names() in class HiveDialect in sqlalchemy_hive.py does not return correct names of tables. The result of query show tables are 3-tuples in the form (schema_name , table_name , isTemporary). You need to get index 1 for table_names, but you get index 0. So it returns a list of duplicates of the schema_name. What can we do to fix this?

rezabaktash avatar Aug 28 '17 06:08 rezabaktash

Spark thrift claims to be compatible with Hive, so I think the correct fix is to edit the spark code. But if they don't want to match Hive, the next best option is probably to figure out a hard coded list of column names compatible with all spark and hive versions. In my opinion, this is inferior since it pushes the problem down to every user of Spark.

On Aug 27, 2017 11:13 PM, "Reza Baktash" [email protected] wrote:

Hi, I found out that the method get_table_names() in class HiveDialect in sqlalchemy_hive.py does not return correct names of tables. The result of query show tables are 3-tuples in the form (schema_name , table_name , isTemporary). You need to get index 1 for table_names, but you get index 0. So it returns a list of duplicates of the schema_name. What can we do to fix this?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dropbox/PyHive/issues/150, or mute the thread https://github.com/notifications/unsubscribe-auth/AB7QYtn6GirL_mQEDKuOiGWXWy3uTf2Qks5sclqkgaJpZM4PEHju .

jingw avatar Aug 28 '17 17:08 jingw

Referencing the line after checking it out: https://github.com/dropbox/PyHive/blob/44717f7b5f1b09d615a0e8d14c9c0ae15404c576/pyhive/sqlalchemy_hive.py#L303

Does Spark SQL return the same format regardless of SHOW TABLES IN {schema} and USE {schema}; SHOW TABLES; ?

mistercrunch avatar Aug 29 '17 03:08 mistercrunch

@mistercrunch No! They both have the same format.

rezabaktash avatar Aug 29 '17 06:08 rezabaktash

@jingw Is it possible to add a third option [sparksql] to pyhive alongside [hive] and [presto]? I know it would be a heavily redundant code. But it seems necessary for Spark to include database/schema name in SHOW TABLES result, because Sparks' temp tables don't have a schema and so they are returned for every SHOW TABLES query and they need to be distinguished from others. Meanwhile we can ask Spark to decide about SHOW TABLES command and its Hive compatibility. Whether to support or not support it.

rezabaktash avatar Aug 29 '17 09:08 rezabaktash

@jingw how about return [row[-1] for row in connection.execute(query)] ?

Would you approve such a PR?

mistercrunch avatar Aug 29 '17 18:08 mistercrunch

Nice, that sounds easier than what I was thinking :)

jingw avatar Aug 29 '17 18:08 jingw

I guess that'll break if someone decides to add another column (maybe show tables includes the table comment some day). Oh well. Both approaches I can think of (column names, column index) have been problematic.

Re spark dialect, I don't currently use sparksql, so I'd rather not add extra modules or branching logic that I can't test.

jingw avatar Aug 29 '17 18:08 jingw

@mistercrunch show tables returns three fields as I mentioned in the first comment (schema_name , table_name , isTemporary). row[-1] will not return table name.

rezabaktash avatar Aug 29 '17 20:08 rezabaktash

I guess that leaves us with having a list of possible column names. #68 changed from row.tab_name to row[0] due to some incompatibility. We could crawl through the Hive/Spark source code for every possible name of this field. Alternatively we could branch on len(row), but that seems even more fragile. Ultimately everything's a workaround for the root inconsistency.

jingw avatar Aug 29 '17 20:08 jingw

That feels fragile / magical to me :/

jingw avatar Aug 29 '17 22:08 jingw

Arguably we should be using the Metastore thrift client for the metadata fetching, but then we'd need to overload the connection string with extra parameters to connect to the Metastore. Not ideal either.

What about create a new Pypi package for SQLAlchemy dialect sparksql:// that would essentially derive the hive:// dialect here and override this one method here. It would allow for it to diverge more in the future if needed.

mistercrunch avatar Aug 30 '17 04:08 mistercrunch

So, what do we do now? :D

rezabaktash avatar Sep 05 '17 05:09 rezabaktash

Can we merge PR and close this bug ?

maver1ck avatar May 29 '18 14:05 maver1ck

bump

tooptoop4 avatar Sep 08 '18 02:09 tooptoop4

Based on above commits, changed small part of "./superset/models/core.py" This is a workaround though, it works.

def get_all_table_names_in_schema(
        self,
        schema: str,
        cache: bool = False,
        cache_timeout: Optional[int] = None,
        force: bool = False,
    ) -> List[utils.DatasourceName]:
         """Parameters need to be passed as keyword arguments.

        For unused parameters, they are referenced in
        cache_util.memoized_func decorator.

        :param schema: schema name
        :param cache: whether cache is enabled for the function
        :param cache_timeout: timeout in seconds for the cache
        :param force: whether to force refresh the cache
        :return: list of tables
        """
        try:
            # Workaround for table names in Spark Thrift server 
            if self.db_engine_spec == db_engine_specs.hive.HiveEngineSpec:
                engine =  self.get_sqla_engine()
                if schema:
                    rs = engine.execute('show tables in %s' % schema).fetchall()
                else:
                    rs = engine.execute('show tables').fetchall()
                return  [utils.DatasourceName(table=x[1], schema=x[0]) for x in rs]

            tables = self.db_engine_spec.get_table_names(
                database=self, inspector=self.inspector, schema=schema
            )

            return [
                utils.DatasourceName(table=table, schema=schema) for table in tables
            ]
        except Exception as ex:  # pylint: disable=broad-except
            logger.warning(ex)

    @cache_util.memoized_func(
        key=lambda self, schema, *args, **kwargs: f"db:{self.id}:schema:{schema}:view_list",  # type: ignore
        cache=cache_manager.data_cache,
    )

kwontaeheon avatar Mar 23 '21 03:03 kwontaeheon