ibis
ibis copied to clipboard
bug: `create_table(temp=True)` timing out due to slow table existence check
What happened?
create_table
for BigQuery backend may become inappropriately inefficient.
bigquery.Backend.create_table(name, df, temp=True)
calls _register_in_memory_tables
, which calls _register_in_memory_table
, which checks if the table exists using:
if raw_name not in self.list_tables(database=(project, dataset)):
The problem for BigQuery is caused by the fact that BQ uses hidden datasets to store temp data (see _make_session
function). These datasets are reused between sessions and may collect more and more tables within table expiration window. I've got 10s of thousands of temp tables. The listing for such large dataset takes minutes (hours, etc) and even causes request throttling from google side.
I have to use this monkey-patching to bypass the issue:
import google.api_core.exceptions as google_exceptions
import google.cloud.bigquery as bq_api
import ibis.expr.operations as ibis_ops
import sqlglot as sg
from ibis.backends.bigquery import Backend as BigQueryBackend
from ibis.backends.bigquery.datatypes import BigQuerySchema
def table_exists(backend: BigQueryBackend, table_name: str, database: tuple[str, str] | str | None = None) -> bool:
try:
_ = backend.table(table_name, database=database)
return True
except google_exceptions.NotFound as e:
if "Not found: Table" in e.message:
return False
else:
raise
def _register_in_memory_table(self: BigQueryBackend, op: ibis_ops.InMemoryTable) -> None:
raw_name = op.name
project = self._session_dataset.project
dataset = self._session_dataset.dataset_id
if not table_exists(self, raw_name, database=(project, dataset)):
table_id = sg.table(raw_name, db=dataset, catalog=project, quoted=False).sql(dialect=self.name)
bq_schema = BigQuerySchema.from_ibis(op.schema)
load_job = self.client.load_table_from_dataframe(
op.data.to_frame(),
table_id,
job_config=bq_api.LoadJobConfig(
# fail if the table already exists and contains data
write_disposition=bq_api.WriteDisposition.WRITE_EMPTY,
schema=bq_schema,
),
)
load_job.result()
BigQueryBackend._register_in_memory_table = _register_in_memory_table
What version of ibis are you using?
9.0.0.
What backend(s) are you using, if any?
BigQuery
Relevant log output
No response
Code of Conduct
- [X] I agree to follow this project's Code of Conduct