ibis icon indicating copy to clipboard operation
ibis copied to clipboard

feat: Add database to read_

Open markdruffel-8451 opened this issue 1 year ago • 3 comments

Is your feature request related to a problem?

The read_ function family allows the user to name a table which writes the file to the default catalog and database.

What is the motivation behind your request?

If I run the code below I get an error, [[TEMP_VIEW_NAME_TOO_MANY_NAME_PARTS](https://docs.microsoft.com/azure/databricks/error-messages/error-classes#temp_view_name_too_many_name_parts)] CREATE TEMPORARY VIEW or the corresponding Dataset APIs only accept single-part view names, but got: comms_media_dev.dart_extensions.test_table. SQLSTATE: 428EK.

import ibis
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
ispark = ibis.pyspark.connect(session = spark)
idf = ispark.read_parquet(source = "abfss://my_parquet", table_name = "comms_media_dev.dart_extensions.test_table")

I can easily resolve this by doing the following:

ispark._session.catalog.setCurrentCatalog("comms_media_dev")
ispark._session.catalog.setCurrentDatabase("dart_extensions")
idf = ispark.read_parquet(source = "abfss:/my_parquet", table_name = "test_table")

This is only a problem because I'm using ibis in a data pipeline and I don't want concurrent nodes to set current catalog and database outside the write operation itself because they might conflict.

Describe the solution you'd like

Ideally read_ functions would have the database parameter, but allowing table_name to accept {catalog}.{database}.{table} would work as well.

What version of ibis are you running?

10.0.0.dev49

What backend(s) are you using, if any?

pyspark

Code of Conduct

  • [X] I agree to follow this project's Code of Conduct

markdruffel-8451 avatar May 09 '24 20:05 markdruffel-8451

Hey @markdruffel-8451 -- thanks for raising this! I think this makes a bunch of sense for the backends where we have catalog/database support, and a database kwarg will have a nice symmetry with the rest of the API.

As an interim workaround, you can make use of a private context manager to handle setting and unsetting the catalog and database (note that this is a private API and might break without warning, but hopefully won't break before we add the database kwarg):

with ispark._active_catalog_database("comms_media_dev", "dart_extensions"):
    idf = ispark.read_parquet(source = "abfss:/my_parquet", table_name = "test_table")

gforsyth avatar May 09 '24 22:05 gforsyth

Hey @markdruffel-8451 -- I'm going to keep this open so we can track adding the database kwarg!

gforsyth avatar May 10 '24 15:05 gforsyth

This also applies for read_csv, and other read_ methods

gforsyth avatar May 10 '24 17:05 gforsyth