superduper icon indicating copy to clipboard operation
superduper copied to clipboard

Support for MotherDuck

Open fnikolai opened this issue 1 year ago • 2 comments

Albeit the sql-exapmple.ipynb is working for duckdb, it is failing for motherduck.

To connect to Motherduck:

# ---------------------------------
# MotherDuck (Cloud)
# ---------------------------------
!pip install "ibis-framework[duckdb]"
token = "<add your token>"
connection_uri = f"duckdb://md:{token}@test_db"

To get the token you must first create an account https://motherduck.com/docs/authenticating-to-motherduck/

The add data to datastore cell fails with:

ProgrammingError: (duckdb.duckdb.InvalidInputException) Invalid Input Error: Attempting to execute an unsuccessful or closed pending query result
Error: Invalid Error: RPC 'SETUP_PLAN_FRAGMENTS' failed: Deadline Exceeded (DEADLINE_EXCEEDED, request id: '05515ef3-a7fe-4c9b-afc9-a57289617d16')
[SQL: INSERT INTO images (id, image, _fold) VALUES (?, ?, ?),  
....
(Background on this error at: https://sqlalche.me/e/14/f405)

Here is the traceback:

---------------------------------------------------------------------------
InvalidInputException                     Traceback (most recent call last)
File ~/.pyenv/versions/superduper/lib/python3.11/site-packages/sqlalchemy/engine/base.py:1910, in Connection._execute_context(self, dialect, constructor, statement, parameters, execution_options, *args, **kw)
   1909     if not evt_handled:
-> 1910         self.dialect.do_execute(
   1911             cursor, statement, parameters, context
   1912         )
   1914 if self._has_events or self.engine._has_events:

File ~/.pyenv/versions/superduper/lib/python3.11/site-packages/sqlalchemy/engine/default.py:736, in DefaultDialect.do_execute(self, cursor, statement, parameters, context)
    735 def do_execute(self, cursor, statement, parameters, context=None):
--> 736     cursor.execute(statement, parameters)

File ~/.pyenv/versions/superduper/lib/python3.11/site-packages/duckdb_engine/__init__.py:160, in ConnectionWrapper.execute(self, statement, parameters, context)
    159     else:
--> 160         self.__c.execute(statement, parameters)
    161 except RuntimeError as e:

InvalidInputException: Invalid Input Error: Attempting to execute an unsuccessful or closed pending query result
Error: Invalid Error: RPC 'SETUP_PLAN_FRAGMENTS' failed: Deadline Exceeded (DEADLINE_EXCEEDED, request id: '05515ef3-a7fe-4c9b-afc9-a57289617d16')

The above exception was the direct cause of the following exception:

ProgrammingError                          Traceback (most recent call last)
Cell In[48], line 2
      1 # Insert data from the 'images_df' DataFrame into the 'images' table
----> 2 _ = db.execute(images.insert(images_df))
      4 # Insert data from the 'captions_df' DataFrame into the 'captions' table
      5 _ = db.execute(captions.insert(captions_df))

File ~/.pyenv/versions/superduper/lib/python3.11/site-packages/superduperdb/base/datalayer.py:386, in Datalayer.execute(self, query, *args, **kwargs)
    384     return self.delete(query, *args, **kwargs)
    385 if isinstance(query, Insert):
--> 386     return self.insert(query, *args, **kwargs)
    387 if isinstance(query, Select):
    388     return self.select(query, *args, **kwargs)

File ~/.pyenv/versions/superduper/lib/python3.11/site-packages/superduperdb/base/datalayer.py:427, in Datalayer.insert(self, insert, refresh, encoders)
    425     if random.random() < s.CFG.fold_probability:
    426         r['_fold'] = 'valid'  # type: ignore[assignment]
--> 427 inserted_ids = insert.execute(self)
    429 if refresh and self.cdc.running:
    430     raise Exception('cdc cannot be activated and refresh=True')

File ~/.pyenv/versions/superduper/lib/python3.11/site-packages/superduperdb/backends/ibis/query.py:846, in IbisInsert.execute(self, db)
    843 encoded_documents = self._encode_documents(table=table)
    844 ids = [r[table.primary_id] for r in encoded_documents]
--> 846 db.databackend.insert(
    847     self.table_or_collection.identifier, raw_documents=encoded_documents
    848 )
    849 return ids

File ~/.pyenv/versions/superduper/lib/python3.11/site-packages/superduperdb/backends/ibis/data_backend.py:36, in IbisDataBackend.insert(self, table_name, raw_documents)
     34 def insert(self, table_name, raw_documents):
     35     if not self.in_memory:
---> 36         self.conn.insert(table_name, raw_documents)
     37     else:
     38         self.conn.create_table(table_name, pandas.DataFrame(raw_documents))

File ~/.pyenv/versions/superduper/lib/python3.11/site-packages/ibis/backends/base/sql/alchemy/__init__.py:777, in BaseAlchemyBackend.insert(self, table_name, obj, database, overwrite)
    775         if overwrite:
    776             bind.execute(to_table.delete())
--> 777         bind.execute(to_table.insert().values(obj))
    779 else:
    780     raise ValueError(
    781         "No operation is being performed. Either the obj parameter "
    782         "is not a pandas DataFrame or is not a ibis Table."
    783         f"The given obj is of type {type(obj).__name__} ."
    784     )

File ~/.pyenv/versions/superduper/lib/python3.11/site-packages/sqlalchemy/engine/base.py:1385, in Connection.execute(self, statement, *multiparams, **params)
   1381     util.raise_(
   1382         exc.ObjectNotExecutableError(statement), replace_context=err
   1383     )
   1384 else:
-> 1385     return meth(self, multiparams, params, _EMPTY_EXECUTION_OPTS)

File ~/.pyenv/versions/superduper/lib/python3.11/site-packages/sqlalchemy/sql/elements.py:334, in ClauseElement._execute_on_connection(self, connection, multiparams, params, execution_options, _force)
    330 def _execute_on_connection(
    331     self, connection, multiparams, params, execution_options, _force=False
    332 ):
    333     if _force or self.supports_execution:
--> 334         return connection._execute_clauseelement(
    335             self, multiparams, params, execution_options
    336         )
    337     else:
    338         raise exc.ObjectNotExecutableError(self)

File ~/.pyenv/versions/superduper/lib/python3.11/site-packages/sqlalchemy/engine/base.py:1577, in Connection._execute_clauseelement(self, elem, multiparams, params, execution_options)
   1565 compiled_cache = execution_options.get(
   1566     "compiled_cache", self.engine._compiled_cache
   1567 )
   1569 compiled_sql, extracted_params, cache_hit = elem._compile_w_cache(
   1570     dialect=dialect,
   1571     compiled_cache=compiled_cache,
   (...)
   1575     linting=self.dialect.compiler_linting | compiler.WARN_LINTING,
   1576 )
-> 1577 ret = self._execute_context(
   1578     dialect,
   1579     dialect.execution_ctx_cls._init_compiled,
   1580     compiled_sql,
   1581     distilled_params,
   1582     execution_options,
   1583     compiled_sql,
   1584     distilled_params,
   1585     elem,
   1586     extracted_params,
   1587     cache_hit=cache_hit,
   1588 )
   1589 if has_events:
   1590     self.dispatch.after_execute(
   1591         self,
   1592         elem,
   (...)
   1596         ret,
   1597     )

File ~/.pyenv/versions/superduper/lib/python3.11/site-packages/sqlalchemy/engine/base.py:1953, in Connection._execute_context(self, dialect, constructor, statement, parameters, execution_options, *args, **kw)
   1950             branched.close()
   1952 except BaseException as e:
-> 1953     self._handle_dbapi_exception(
   1954         e, statement, parameters, cursor, context
   1955     )
   1957 return result

File ~/.pyenv/versions/superduper/lib/python3.11/site-packages/sqlalchemy/engine/base.py:2134, in Connection._handle_dbapi_exception(self, e, statement, parameters, cursor, context)
   2132     util.raise_(newraise, with_traceback=exc_info[2], from_=e)
   2133 elif should_wrap:
-> 2134     util.raise_(
   2135         sqlalchemy_exception, with_traceback=exc_info[2], from_=e
   2136     )
   2137 else:
   2138     util.raise_(exc_info[1], with_traceback=exc_info[2])

File ~/.pyenv/versions/superduper/lib/python3.11/site-packages/sqlalchemy/util/compat.py:211, in raise_(***failed resolving arguments***)
    208     exception.__cause__ = replace_context
    210 try:
--> 211     raise exception
    212 finally:
    213     # credit to
    214     # https://cosmicpercolator.com/2016/01/13/exception-leaks-in-python-2-and-3/
    215     # as the __traceback__ object creates a cycle
    216     del exception, replace_context, from_, with_traceback

File ~/.pyenv/versions/superduper/lib/python3.11/site-packages/sqlalchemy/engine/base.py:1910, in Connection._execute_context(self, dialect, constructor, statement, parameters, execution_options, *args, **kw)
   1908                 break
   1909     if not evt_handled:
-> 1910         self.dialect.do_execute(
   1911             cursor, statement, parameters, context
   1912         )
   1914 if self._has_events or self.engine._has_events:
   1915     self.dispatch.after_cursor_execute(
   1916         self,
   1917         cursor,
   (...)
   1921         context.executemany,
   1922     )

File ~/.pyenv/versions/superduper/lib/python3.11/site-packages/sqlalchemy/engine/default.py:736, in DefaultDialect.do_execute(self, cursor, statement, parameters, context)
    735 def do_execute(self, cursor, statement, parameters, context=None):
--> 736     cursor.execute(statement, parameters)

File ~/.pyenv/versions/superduper/lib/python3.11/site-packages/duckdb_engine/__init__.py:160, in ConnectionWrapper.execute(self, statement, parameters, context)
    158         self.__c.execute(statement)
    159     else:
--> 160         self.__c.execute(statement, parameters)
    161 except RuntimeError as e:
    162     if e.args[0].startswith("Not implemented Error"):

fnikolai avatar Jan 03 '24 10:01 fnikolai

Following error:

ProgrammingError: (duckdb.duckdb.InvalidInputException) Invalid Input Error: Attempting to execute an unsuccessful or closed pending query result
Error: FATAL Error: Failed: database has been invalidated because of a previous fatal error. The database must be restarted prior to being used again.
Original error: "FATAL Error: You've encountered an internal MotherDuck error. You can help us diagnose and fix the issue by contacting support via slack or [email protected] and referencing error ID: 20240103a0d."
(Background on this error at: https://sqlalche.me/e/14/f405)

fnikolai avatar Jan 04 '24 11:01 fnikolai

Here the investigations of the MotherDuck team:

We identified the issue and a possible workaround. Ibis inlines the blob values in the INSERT INTO statement which makes the logical execution plan that has to be shipped between the DuckDB client and motherduck quite large and finally leads to the timeout (4MB is the current maximum size we support, the plan in the example notebook was around 48MB). We will add support for larger plans at some point, but I want to suggest a workaroud. Instead of inlining the BLOB values in the INSERT INTO statement, you can use DuckDB's zero-copy mechnism to insert values from images_df with a statement like this: INSERT INTO images (id, image, _fold) SELECT id, image, 'test' FROM images_df (instead of the INSERT INTO images VALUES(..) that is currently generated within _ = db.execute(images.insert(images_df)) ) Since the insert is happening behind the Ibis abstraction, it might require some special-casing for motherduck, but overall I'd advise to perform inserts in such a way even if we are going to add support for larger plans in the future.

blythed avatar Jan 31 '24 14:01 blythed