ibis icon indicating copy to clipboard operation
ibis copied to clipboard

bug: `create_table(temp=True)` timing out due to slow table existence check

Open asiunov opened this issue 9 months ago • 2 comments

What happened?

create_table for BigQuery backend may become inappropriately inefficient.

bigquery.Backend.create_table(name, df, temp=True) calls _register_in_memory_tables, which calls _register_in_memory_table, which checks if the table exists using:

if raw_name not in self.list_tables(database=(project, dataset)):

The problem for BigQuery is caused by the fact that BQ uses hidden datasets to store temp data (see _make_session function). These datasets are reused between sessions and may collect more and more tables within table expiration window. I've got 10s of thousands of temp tables. The listing for such large dataset takes minutes (hours, etc) and even causes request throttling from google side.

I have to use this monkey-patching to bypass the issue:

import google.api_core.exceptions as google_exceptions
import google.cloud.bigquery as bq_api
import ibis.expr.operations as ibis_ops
import sqlglot as sg
from ibis.backends.bigquery import Backend as BigQueryBackend
from ibis.backends.bigquery.datatypes import BigQuerySchema


def table_exists(backend: BigQueryBackend, table_name: str, database: tuple[str, str] | str | None = None) -> bool:
  try:
    _ = backend.table(table_name, database=database)
    return True
  except google_exceptions.NotFound as e:
    if "Not found: Table" in e.message:
      return False
    else:
      raise


def _register_in_memory_table(self: BigQueryBackend, op: ibis_ops.InMemoryTable) -> None:
  raw_name = op.name

  project = self._session_dataset.project
  dataset = self._session_dataset.dataset_id

  if not table_exists(self, raw_name, database=(project, dataset)):
    table_id = sg.table(raw_name, db=dataset, catalog=project, quoted=False).sql(dialect=self.name)

    bq_schema = BigQuerySchema.from_ibis(op.schema)
    load_job = self.client.load_table_from_dataframe(
      op.data.to_frame(),
      table_id,
      job_config=bq_api.LoadJobConfig(
        # fail if the table already exists and contains data
        write_disposition=bq_api.WriteDisposition.WRITE_EMPTY,
        schema=bq_schema,
      ),
    )
    load_job.result()


BigQueryBackend._register_in_memory_table = _register_in_memory_table

What version of ibis are you using?

9.0.0.

What backend(s) are you using, if any?

BigQuery

Relevant log output

No response

Code of Conduct

  • [X] I agree to follow this project's Code of Conduct

asiunov avatar May 21 '24 01:05 asiunov