duckdb icon indicating copy to clipboard operation
duckdb copied to clipboard

Add mechanism to ingest Arrow data via the JDBC driver.

Open jonathanswenson opened this issue 2 years ago • 23 comments

Arrow's next release should include the ability to export an ArrowReader as a C data interface ArrowArrayStream. PR / JIRA

With this it should be possible to implement the DuckDBArrayStreamWrapper interface to ingest arrow data from java.

This approach does not rely on any external dependencies and therefore a client will theoretically have to implement the DuckDBArrayStreamInterface to use this.

The implementation is trivial, but might be nicer to support ingesting an ArrowReader instead.

Otherwise, this implementation tries to mirror the WASM / Python implementation.

Few TODOS:

  • [x] arrow actually has to release the updated ArrowArrayStream importer / exporter for java this for it to be useful. (though because duckdb-jdbc doesn't actually depend on it, it doesn't matter).
  • [ ] determine if it is worth figuring out how to include the arrow IPC pieces in the JDBC driver
  • [ ] testing (somewhat challenging given a lack of bundled arrow c-data interface, but anything that fills an ArrowArrayStream could do)
  • [ ] finding a home for JavaArrowStreamFactory that isn't just in the JNI/duckdb_java.cpp file.
  • [ ] determine what batch size makes sense (I copied what the python one was doing -- I think)
  • [x] not all timestamp vectors are supported. After ingesting a TimestampMilis Vector extracting data via JDBC fails to deserialize the timestamp vector.

Maybe you could say that this would solve https://github.com/duckdb/duckdb/issues/3489, but its currently closed 🙂

jonathanswenson avatar Jul 08 '22 21:07 jonathanswenson

Oops, shouldn't really have been a draft.

jonathanswenson avatar Jul 08 '22 22:07 jonathanswenson

example implementation of DuckDBArrayStreamWrapper (in kotlin).

val vectorSchemaRoot = arrowReader.vectorSchemaRoot
val schema = vectorSchemaRoot.schema

val wrapper = object : DuckDBArrayStreamWrapper {
    override fun exportArrayStream(arrayStreamPointer: Long) {
        val arrowArrayStream = ArrowArrayStream.wrap(arrayStreamPointer)
        Data.exportArrayStream(allocator, arrowReader, arrowArrayStream)
    }

    override fun exportSchema(schemaPointer: Long) {
        val arrowSchema = ArrowSchema.wrap(schemaPointer)
        Data.exportSchema(allocator, schema, null, arrowSchema)
    }
}

duckDBConnection.ingestArrowTable("main", tableName, wrapper)

jonathanswenson avatar Jul 08 '22 22:07 jonathanswenson

Regarding the Timestamp Milis vector issue mentioned above, the default JDBC -> Arrow mapping (from the java arrow-jdbc package) produces timestamp milli vectors which, when queried, fails to deserialize in the JDBC layer.

Implementing the opposite ResultSet -> Arrow will likely solve this problem, and should be pretty straightforward given the changes from https://issues.apache.org/jira/browse/ARROW-16913, but until then querying a table fails with the following exception:

No enum constant org.duckdb.DuckDBColumnType.TIMESTAMP_MS
java.lang.IllegalArgumentException: No enum constant org.duckdb.DuckDBColumnType.TIMESTAMP_MS
	at java.base/java.lang.Enum.valueOf(Enum.java:273)
	at org.duckdb.DuckDBColumnType.valueOf(DuckDBColumnType.java:3)
	at org.duckdb.DuckDBResultSetMetaData.TypeNameToType(DuckDBResultSetMetaData.java:51)
	at org.duckdb.DuckDBResultSetMetaData.<init>(DuckDBResultSetMetaData.java:29)
	at org.duckdb.DuckDBNative.duckdb_jdbc_meta(Native Method)
	at org.duckdb.DuckDBPreparedStatement.prepare(DuckDBPreparedStatement.java:97)
	at org.duckdb.DuckDBPreparedStatement.executeQuery(DuckDBPreparedStatement.java:174)

Edit: I got a little further in my implementation of the ResultSet -> ArrowArrayStream impl (see comment below) and it doesn't actually fix this, as the failure appears to be in the creation of the ResultSet rather than access through ResultSet Metadata. In anycase this might be something that needs address incase someone were to, for some reason, use JDBC after importing an arrow dataset.

jonathanswenson avatar Jul 09 '22 02:07 jonathanswenson

Can you also have a look here: https://github.com/hannes/duckdb/commit/7ebc91a276162246f3299f64a40a249f9f5c1642

hannes avatar Jul 10 '22 04:07 hannes

@hannes This does seem like a reasonable direction -- I had started going down this route myself before noticing that the folks over at arrow were going to take on the ArrowArrayStream work.

I had started working on a similar implementation for the ArrayStreamExporter as well, I'm not quite finished, but it runs into some slightly more complicated problems if you don't want to include the arrow libraries as dependencies. In particular the method native method returns an object that would then need to be cast to ArrowReader.

Very much open to other ideas, but using the ArrowArrayAtream readers (and arrowReader) is a lot easier from the consumption side.

More or less following the approach of this PR it looked something like this: Can find a WIP commit here: https://github.com/duckdb/duckdb/commit/0c2ce230e117dc11e76d081972d5dead5b018474

The the result could be consumed as follows (more kotlin code):

val duckResultSet = resultSet as DuckDBResultSet

val importer = DuckDBArrayStreamImporter { arrayStreamPointer ->
    val stream = ArrowArrayStream.wrap(arrayStreamPointer)
    Data.importArrayStream(allocator, stream)
}
val arrowReader = duckResultSet.arrowExportArrayStream(importer) as ArrowReader

val vsr = arrowReader.vectorSchemaRoot
val field = vsr.schema.fields.first()
println("name: ${field.name}, type: ${field.type}, fieldType${field.fieldType}")
while (arrowReader.loadNextBatch()) {
    // TODO iterate through results in this vector
    val fieldVector = vsr.getVector(field)
    println(fieldVector.getObject(0))
}

jonathanswenson avatar Jul 10 '22 23:07 jonathanswenson

Hey Jonathan, thank you for your PR and for taking the initial step on consuming arrow through the jdbc.

I understand it is experimental and challenging to add tests for it, but I don't think we should merge untested code (I understand it also depends on unreleased third-party code. For the R and python clients we test against the arrow dev builds, so maybe something similar can be performed here?)

In general, I would like to see round-trip tests with all supported duckdb/arrow types. It is fine if not all types are working, if it's something that depends on the Arrow implementation not being finished, but these should also be explicit on the tests. Something similar to https://github.com/duckdb/duckdb/blob/64f33ecd2d08e330e14a7d52b9fa6c82ba59fcff/tools/pythonpkg/tests/fast/test_all_types.py#L87 and tests with more than one batch of data.

pdet avatar Jul 11 '22 08:07 pdet

Totally understand the desire for this to be more tested 🙂 Testing against the arrow dev builds seems like a reasonable approach for now (and I'm also ok with punting on this for now until the next version is released).

I know of a desire is to keep the duckdb library free of external dependencies (such as pulling in the arrow IPC libraries), but I'm not sure if this is just for the core duckdb library or if this restriction also applies to the tools (such as python, R, and java). It seems like the more straightforward way to test this would require pulling in the arrow libraries (and implementing something similar to the Wrapper interface I have above).

Pulling in these dependencies would also make the implementation easier (consumers would not have to implement their own DuckDBArrayStreamWrapper or DuckDBArrayStreamImporter which are only required due to this restriction.

I'm unsure how exactly one might go about pulling in the dependency (in test or prod) -- prod is probably a simple modification of the pom that is deployed to maven, while test would require pulling it down / adding to the class path.

jonathanswenson avatar Jul 11 '22 08:07 jonathanswenson

Yes, there is a very strict policy on what can be a dependency. Arrow is not and should not be a dependency on any API. However, the tests and their respective CIs can have dependencies that are unrelated to the actual client. For example, in both Python and R, we do install Arrow on the CI, and the respective tests check if they can load arrow, otherwise, they just skip the test. We also add arrow in our dev-requiriments for both these APIs. I'm not entirely sure how to nicely replicate this behavior on the Java client, but that would be the desired goal.

pdet avatar Jul 11 '22 08:07 pdet

I'm curious what counts as a dependency here. Would adding arrow-c-data as a dependency in the generated pom file (https://github.com/jonathanswenson/duckdb/blob/9b0c6048e405b0e23f2c21aaf386b24322428320/scripts/jdbc_maven_deploy.py#L49-L99) be OK? It seems like this is more or less what the wasm implementation is doing: https://github.com/duckdb/duckdb-wasm/blob/master/packages/duckdb-wasm/package.json#L26

Or is this an exception to the no-dependencies rule?

I can see a way to run the tests (pulling the arrow-c-data jar from somewhere in the sky and adding it to the classpath when running the tests). But you likely need a swath of other transitive dependencies to build vectors and the like. I'll play with what is required here. I'm not very familiar with running java compile / execution from cmake, but maybe @hannes has some ideas (or already has a plan).

I can also theoretically see some sort of optional, additional maven artifact that provides some of these capabilities, but are not required to run the java system. Something like distributing a duckdb-jdbc-arrow that, when paired with the duckdb-jdbc maven artifact can provide the additional ingestion / export capabilities and actually depend on arrow. I'm not sure how feasible this is (or if it aligns with the policies in place).

Edit: for reference I found the nightly java jars here: https://github.com/ursacomputing/crossbow/releases/tag/nightly-packaging-2022-07-11-0-github-java-jars.

jonathanswenson avatar Jul 11 '22 16:07 jonathanswenson

I think the difference is that the duckdb-wasm uses Arrow as the data protocol for the data import and all query results. So it's intrinsic for the functionality of the integration, that's not the same for the Java/Python/R APIs, in their case they are an extra functionality (e.g., scanning and producing arrow objects)

pdet avatar Jul 11 '22 17:07 pdet

It doesn't look like maven is being used for testing yet (unsure if that is a desire). However it seems pretty easy to pull in nightlies from the nightlies maven repo: https://arrow.apache.org/docs/dev/java/install.html?highlight=nightly#installing-from-apache-nightlies

Edit: the transitive dependencies of arrow-c-data are big enough that maybe manually doing this will be a bit of a pain.

However, for the time being, I could see something along the lines of a script that would pul a nightly (or in the future stable) version of arrow from the cloud, adding it to the classpath and executing something similar to what is currently being run in CI:

arrow java jar nightlies are here: https://github.com/ursacomputing/crossbow/releases/tag/nightly-packaging-2022-07-10-0-github-java-jars

wget ... -o arrow-vector.jar
wget ... -o arrow-c-data.jar
java -cp build/release/tools/jdbc/duckdb_jdbc.jar arrow-vector.jar arrow-c-data.jar org.duckdb.test.TestDuckDBJDBC

I think this would also require splitting the tests out of the distributed jar though (well maybe not require, but there would definitely be possible to get ClassNotFound exceptions if you tried to run the tests or something without the jars on the path). Unless this is done the compilation step will also have to have these jars handy.

jonathanswenson avatar Jul 11 '22 18:07 jonathanswenson

Playing with a simple ingestion of an arrow stream and validation of results makes it pretty clear that testing is a bit of a PITA.

Was able to make it work by adding compile time dependencies:

arrow-c-data-9.0.0-SNAPSHOT.jar
arrow-vector-9.0.0-SNAPSHOT.jar
arrow-memory-core-9.0.0-SNAPSHOT.jar
jackson-annotations-2.13.2.jar

But I needed a bunch more in order to make the test actually execute (I just picked some random versions for the time being):

arrow-c-data-9.0.0-SNAPSHOT.jar
arrow-vector-9.0.0-SNAPSHOT.jar
arrow-memory-core-9.0.0-SNAPSHOT.jar
slf4j-api-1.7.7.jar
arrow-memory-netty-9.0.0-SNAPSHOT.jar
etty-common-4.1.72.Final.jar
netty-buffer-4.1.72.Final.jar
flatbuffers-java-1.12.0.jar
arrow-format-9.0.0-SNAPSHOT.jar
jackson-core-2.13.2.jar
jackson-databind-2.13.2.2.jar
jackson-annotations-2.13.2.jar

I have some round trip tests written, but am stuck on the compile step for CI. I haven't used cmake in a while... Also running into some complications around unsigned types (and huge int types) as I haven't quite figured out how to express them in arrow Java.

Edit: Can do something like this to keep the distributed jar pure (compiled without these dependencies -- and excludes the test class files now). Generates a duckdb_jdbc_test.jar file.

file(GLOB JAVA_TEST_FILES src/test/java/org/duckdb/test/*.java)

set(ARROW_LOCATION /Users/me/src/arrow) # different place for this
set(MAVEN_LOCATION /Users/me/.m2/repository) # different place for this 

set(JAVA_DEPS
    ${ARROW_LOCATION}/java/c/target/arrow-c-data-9.0.0-SNAPSHOT.jar
    ${ARROW_LOCATION}/java/vector/target/arrow-vector-9.0.0-SNAPSHOT.jar
    ${ARROW_LOCATION}/java/memory/memory-core/target/arrow-memory-core-9.0.0-SNAPSHOT.jar
    ${MAVEN_LOCATION}/com/fasterxml/jackson/core/jackson-annotations/2.13.2/jackson-annotations-2.13.2.jar)

add_jar(duckdb_jdbc_test ${JAVA_TEST_FILES}
        ENTRY_POINT org.duckdb.test.TestDuckDBJDBC
        INCLUDE_JARS
        $<TARGET_FILE_DIR:duckdb_java>/duckdb_jdbc.jar
        ${JAVA_DEPS})

With this generated jar file one can run the tests like:

java -cp "build/release/tools/jdbc/duckdb_jdbc_test.jar:build/release/tools/jdbc/duckdb_jdbc.jar:<OTHER DEPENDENCIES>" org.duckdb.test.TestDuckDBJDBC

jonathanswenson avatar Jul 12 '22 20:07 jonathanswenson

I'd certainly be open to seeing maven used for building the java artifacts (replacing some of the handwritten scripts currently used) - I wonder if it's possible to have arrow be pulled in purely as a compile time dep? Then at runtime if it's not provided on the classpath you'd get the standard class not found error

I've experimented a bit on my fork as well, though getting cmake to play nicely with maven is interesting: https://github.com/duckdb/duckdb/compare/master...Mause:duckdb:feature/maven

Mause avatar Jul 13 '22 06:07 Mause

@pdet can you have a look at my branch which manages without a dependency and see if this can be reconciled? Thanks!

hannes avatar Jul 13 '22 07:07 hannes

@hannes the dependency is not required in the implementation, but instead the testing that @pdet would like to see. Theoretically, it can be achieved without any dependency -- even in the tests -- but would require constructing the c data ArrowArrayStream / ArrowArray / ArrowSchema pieces via some other mechanism (like calling back into the c++ code). This wouldn't really reflect how this library would be used, but might satisfy @pdet's testing request.

jonathanswenson avatar Jul 13 '22 16:07 jonathanswenson

Doing the following in the cmakelists file will cause the system to compile the test jar (split out from the primary duckdb_jdbc jar) with the arrow dependencies, allowing the duckdb JDBC jar to be build without any dependencies (but will no longer include the test class file).

set(ARROW_VERSION 9.0.0.dev385)
set(JAR_DIR "${CMAKE_BINARY_DIR}/tools/jdbc")

# download compile dependencies
file(DOWNLOAD https://nightlies.apache.org/arrow/java/org/apache/arrow/arrow-c-data/${ARROW_VERSION}/arrow-c-data-${ARROW_VERSION}.jar
     ${JAR_DIR}/arrow-c-data.jar)
file(DOWNLOAD https://nightlies.apache.org/arrow/java/org/apache/arrow/arrow-memory-core/${ARROW_VERSION}/arrow-memory-core-${ARROW_VERSION}.jar
     ${JAR_DIR}/arrow-memory-core.jar)
file(DOWNLOAD https://nightlies.apache.org/arrow/java/org/apache/arrow/arrow-vector/${ARROW_VERSION}/arrow-vector-${ARROW_VERSION}.jar
     ${JAR_DIR}/arrow-vector.jar)
file(DOWNLOAD https://repo1.maven.org/maven2/com/fasterxml/jackson/core/jackson-annotations/2.13.3/jackson-annotations-2.13.3.jar
     ${JAR_DIR}/jackson-annotations.jar)

FILE(GLOB JAVA_TEST_FILES src/test/java/org/duckdb/test/*.java)

set(JAVA_DEPS
    ${JAR_DIR}/arrow-c-data.jar
    ${JAR_DIR}/arrow-memory-core.jar
    ${JAR_DIR}/arrow-vector.jar
    ${JAR_DIR}/jackson-annotations.jar)

add_jar(duckdb_jdbc_test ${JAVA_TEST_FILES}
        ENTRY_POINT org.duckdb.test.TestDuckDBJDBC
        INCLUDE_JARS
        $<TARGET_FILE_DIR:duckdb_java>/duckdb_jdbc.jar
        ${JAVA_DEPS})

Doing so will allow a test class (such as the following) to be written / compiled / used in tests. With this, in a test, you can create an arrow Reader in java, wrap it with this stream wrapper and ingest it into duckdb and test it throughly.

private static DuckDBArrayStreamWrapper wrapStream(ArrowReader passedReader, BufferAllocator passedAllocator) throws Exception {
    final ArrowReader arrowReader = passedReader;
    final BufferAllocator allocator = passedAllocator;
    final Schema schema = arrowReader.getVectorSchemaRoot().getSchema();

    return new DuckDBArrayStreamWrapper() {
        @Override
        public void exportArrayStream(long arrowArrayStreamPointer) {
            ArrowArrayStream arrowArrayStream = ArrowArrayStream.wrap(arrowArrayStreamPointer);
            Data.exportArrayStream(allocator, arrowReader, arrowArrayStream);
        }

        @Override
        public void exportSchema(long arrowSchemaPointer) {
            ArrowSchema arrowSchema = ArrowSchema.wrap(arrowSchemaPointer);
            Data.exportSchema(allocator, schema, null, arrowSchema);
        }
    };
}

The test jar won't be executable unless the other dependencies are pulled down and added to the classpath (which I still haven't figured out how to do in CI yet).

jonathanswenson avatar Jul 15 '22 01:07 jonathanswenson

This seems to work (at least on linux): https://github.com/jonathanswenson/duckdb/commit/77ec47eb367755e3412aa5a3f2b3fb0b354d84f9#diff-82aedb94b8a051102f611bcc51b8027756d657ba5e371439bbd22f389e6e745eR52-R98

See LinuxRelease test here: https://github.com/jonathanswenson/duckdb/runs/7350669316?check_suite_focus=true

Seems to have some warnings about some of the way I'm doing things, should be solvable though.

CMake Warning (dev):
  Policy CMP0058 is not set: Ninja requires custom command byproducts to be
  explicit.  Run "cmake --help-policy CMP0058" for policy details.  Use the
  cmake_policy command to set the policy and suppress this warning.
  This project specifies custom command DEPENDS on files in the build tree
  that are not specified as the OUTPUT or BYPRODUCTS of any
  add_custom_command or add_custom_target:
   tools/jdbc/java_dependencies/arrow-c-data.jar
   tools/jdbc/java_dependencies/arrow-memory-core.jar
   tools/jdbc/java_dependencies/arrow-vector.jar
   tools/jdbc/java_dependencies/jackson-annotations.jar
  For compatibility with versions of CMake that did not have the BYPRODUCTS
  option, CMake is generating phony rules for such files to convince 'ninja'
  to build.
  Project authors should add the missing BYPRODUCTS or OUTPUT options to the
  custom commands that produce these files.
This warning is for project developers.  Use -Wno-dev to suppress it.

~Also the tests seem to compile, but not run on windows, I think the classpath needs to be separated by ; instead of : for windows though. Could be that the paths themselves are incorrect (using / instead of \ or something).~

Edit: Test don't even compile on windows (https://github.com/jonathanswenson/duckdb/runs/7350669227?check_suite_focus=true). Seems like this might only be the DuckDB jar (rather than the dependencies?) but a little difficult to tell.

cl : command line warning D9002: ignoring unknown option '-fexceptions' [D:\a\duckdb\duckdb\tools\jdbc\duckdb_java.vcxproj]
  duckdb_java.cpp
D:\a\duckdb\duckdb\tools\jdbc\src\jni\duckdb_java.cpp(635,1): warning C4297: 'Java_org_duckdb_DuckDBNative_duckdb_1jdbc_1fetch': function assumed not to throw an exception but does [D:\a\duckdb\duckdb\tools\jdbc\duckdb_java.vcxproj]
D:\a\duckdb\duckdb\tools\jdbc\src\jni\duckdb_java.cpp(635,1): message : __declspec(nothrow), throw(), noexcept(true), or noexcept was specified on the function [D:\a\duckdb\duckdb\tools\jdbc\duckdb_java.vcxproj]
D:\a\duckdb\duckdb\tools\jdbc\src\jni\duckdb_java.cpp(928,56): warning C4311: 'type cast': pointer truncation from 'ArrowArrayStream *' to 'long' [D:\a\duckdb\duckdb\tools\jdbc\duckdb_java.vcxproj]
D:\a\duckdb\duckdb\tools\jdbc\src\jni\duckdb_java.cpp(944,113): warning C4311: 'type cast': pointer truncation from 'ArrowSchema *' to 'long' [D:\a\duckdb\duckdb\tools\jdbc\duckdb_java.vcxproj]
D:\a\duckdb\duckdb\tools\jdbc\src\jni\duckdb_java.cpp(999,75): warning C4311: 'type cast': pointer truncation from 'ArrowArrayStream *' to 'long' [D:\a\duckdb\duckdb\tools\jdbc\duckdb_java.vcxproj]
     Creating library D:/a/duckdb/duckdb/tools/jdbc/Release/duckdb_java.lib and object D:/a/duckdb/duckdb/tools/jdbc/Release/duckdb_java.exp
  duckdb_java.vcxproj -> D:\a\duckdb\duckdb\tools\jdbc\Release\libduckdb_java.so_windows_amd64
  Building Java objects for duckdb_jdbc_test.jar
CUSTOMBUILD : warning : [options] bootstrap class path not set in conjunction with -source 1.7 [D:\a\duckdb\duckdb\tools\jdbc\duckdb_jdbc_test.vcxproj]
  D:\a\duckdb\duckdb\tools\jdbc\src\test\java\org\duckdb\test\TestDuckDBJDBC.java:39: error: cannot find symbol
  import org.duckdb.DuckDBAppender;
                   ^
    symbol:   class DuckDBAppender
    location: package org.duckdb

jonathanswenson avatar Jul 15 '22 02:07 jonathanswenson

@jonathanswenson that does look class path related yeah, I would put money on it being the class path separator at fault. My preference would still be that be use maven to orchestrate all this

Mause avatar Jul 15 '22 05:07 Mause

@Mause yeah I think maven would certainly help here, I’m unsure how one would go about calling into it from cmake, or ensuing that it is installed.

I can look into it though — I’m woefully unfamiliar with cmake.

jonathanswenson avatar Jul 15 '22 05:07 jonathanswenson

Potentially worth noting, this is now a little easier as arrow 9.0.0 was released (don't need to depend on a pre-release build).

However the build / testing challenges are still present.

jonathanswenson avatar Aug 15 '22 04:08 jonathanswenson

Hey @jonathanswenson, we recently got out-of-tree extensions working, these extensions allow us to develop them in a separate repository, without limitations on dependencies.

Yesterday, I started working on an Arrow Extension for cpp, since for cpp we actually have to link stuff and get Arrow as a dependency there. Maybe it is worth exploring if adding the Java one as an out-of-tree extension would be a better solution.

pdet avatar Sep 01 '22 10:09 pdet

@pdet I'm certainly interested in something like this. Is there a framework for out of tree extensions? Is the idea that this is something that is loaded similar to the parquet extension?

Or is the concept more general -- just a separate library that can depend on / link / etc the duckdb library?

I could theoretically see the entire JDBC driver extracted and written as its own library that depends on the duckdb library, also pulling in arrow (and maybe some test dependencies like junit), but I'm unsure if that fits into what you are describing.

jonathanswenson avatar Sep 01 '22 16:09 jonathanswenson

I think so too, in theory, any of our current extensions could be out-of-tree, and I do think this brings many advantages, as it reduces the necessity of inlining dependencies, reduces the code size on the main repo, and can also reduce the number of tests we run in the main repo CI, since these can be executed automatically on the out-of-tree repo :)

With that being said, I don't think there has been any internal discussion on what will become an out-of-tree extension or not. Besides moving some statistical functions, which would also reduce the binary size.

pdet avatar Sep 02 '22 07:09 pdet

@jonathanswenson should we close this now that we have something in?

Mause avatar Nov 14 '22 02:11 Mause

The data ingestion still hasn't been addressed, @hannes’s PR only implements the export of arrow data from a jdbc result set.

But it doesn’t look like we’re moving forward with this approach, so we can close it.

jonathanswenson avatar Nov 14 '22 03:11 jonathanswenson

The data ingestion still hasn't been addressed, @hannes’s PR only implements the export of arrow data from a jdbc result set.

But it doesn’t look like we’re moving forward with this approach, so we can close it.

Ah, my mistake. We can keep this open then

Mause avatar Nov 14 '22 03:11 Mause

Hey all. We have an Arrow Flight SQL service that can query various Datawarehouse backends (currently through JDBC, but hopefully soon through either ADBC or Arrow streams from their respective JDBC drivers if available) and stream the results back through AFS. We're considering adding an in-between layer for some additional manipulation and DuckDB seems like a perfect fit. The current registerArrowStream doesn't work for us because we expect some result sets to be too big to fit in memory. If we could instead stream arrow results into disk using ingestArrowTable, execute our SQL against DuckDB, and stream the results back out directly through AFS, this would seem a lot more feasible.

Reading through this thread, it seems like the issue right now is testing? has anything changed in the past year that would enable this?

aiguofer avatar Aug 06 '23 22:08 aiguofer

@aiguofer the approach here was largely the same as the final implementation by @hannes (with some changes to the API that allowed for avoiding loading the arrow java libraries at compile time).

As far as I know, you should be able to write arrow "files" to disk, and then open up a ArrowFileReader (or ArrowStreamReader if they were written using the stream format) and then use this to load these results into duckdb without holding the result entirely in memory. After registering the arrow streams, you should be able to create a persistent table that (as long as duckdb is not running entirely in memory) shouldn't require putting the tables in memory.

jonathanswenson avatar Aug 07 '23 02:08 jonathanswenson