arrow icon indicating copy to clipboard operation
arrow copied to clipboard

ARROW-17661: [C++][Python] Add Flight SQL ADBC driver

Open lidavidm opened this issue 3 years ago • 7 comments

Add an ADBC driver based on Flight SQL to the Arrow libraries (C++ and PyArrow). This makes the driver available to any application linking to libarrow_flight_sql.so, as well as to any application that imports pyarrow.flight_sql. The Python driver manager must be installed for the Python bindings; this is treated as an optional dependency.

On top of that, the sample SQLite Flight SQL server was improved to use as a reference implementation. The ADBC test suite is downloaded from the ADBC repository and run against this server.

lidavidm avatar Sep 09 '22 15:09 lidavidm

  • [x] Improve the Flight SQL driver and test suite (and figure out how exactly we're going to bridge this with the generic test suite in the ADBC repo - possibly as a CMake FetchContent dependency?)
  • [x] Add basic Python tests
  • [x] Enable Flight SQL on builds
  • [x] Implement/test Substrait support
  • [x] Implement Connection.GetObjects
  • ~~[ ] Add tests for OpenTelemetry~~ (OTel middleware is already tested)
  • [x] How will OTel work in Python? (Add support for arbitrary call headers, which can be used to manually propagate OTel headers in Python)
  • [x] Make sure header support works on the connection too
  • [x] Consider a naming scheme for custom options
  • [x] Add support for configuring timeouts
  • ~~[ ] Add documentation page (maybe in apache/arrow-adbc instead)~~ https://github.com/apache/arrow-adbc/pull/138
  • [x] Fix how partitions work and add an explicit test
  • ~~[ ] Document how partitions work and document how to poke at the Flight bits underneath as needed~~ https://github.com/apache/arrow-adbc/pull/138
  • [x] Update FetchContent
  • [x] Get the driver manager published so this can be used easily (nightly wheels added now)
  • [ ] Enable source-specific connection properties
  • [x] Ensure this builds on macOS
  • [x] Update the ADBC reference

lidavidm avatar Sep 09 '22 15:09 lidavidm

>>> import pyarrow.flight_sql
>>> conn = pyarrow.flight_sql.connect("grpc://localhost:31337")
>>> cur = conn.cursor()
>>> cur.execute("SELECT 1, 'hi', 2.0")
>>> cur.fetch_arrow_table()
pyarrow.Table
1: int64
'hi': string
2.0: double
----
1: [[1]]
'hi': [["hi"]]
2.0: [[2]]

lidavidm avatar Sep 09 '22 15:09 lidavidm

https://issues.apache.org/jira/browse/ARROW-17661

github-actions[bot] avatar Sep 09 '22 17:09 github-actions[bot]

:warning: Ticket has not been started in JIRA, please click 'Start Progress'.

github-actions[bot] avatar Sep 09 '22 17:09 github-actions[bot]

Dependent on https://github.com/apache/arrow/pull/14079.

lidavidm avatar Sep 20 '22 17:09 lidavidm

There's several improvements I plan to make here soon…so this should be WIP

lidavidm avatar Sep 23 '22 14:09 lidavidm

Let's add the ability to pass in source-specific connection properties as headers.

jduo avatar Oct 21 '22 23:10 jduo

I tried to build this branch with the next cmake options:

-DCMAKE_BUILD_TYPE=Debug
-DCMAKE_INSTALL_LIBDIR=lib
-DARROW_PYTHON=ON
-DARROW_INSTALL_NAME_RPATH=OFF
-DARROW_EXTRA_ERROR_CONTEXT=ON
-DARROW_CUDA=OFF
-DARROW_NO_DEPRECATED_API=OFF
-DARROW_BUILD_STATIC=OFF
-DARROW_INSTALL_NAME_RPATH=ON
-DARROW_USE_LD_GOLD=ON
-DARROW_USE_CCACHE=ON
-DARROW_USE_ASAN=OFF
-DARROW_USE_TSAN=OFF
-DARROW_USE_UBSAN=OFF
-DARROW_BUILD_EXAMPLES=ON
-DARROW_BUILD_TESTS=ON
-DARROW_TEST_MEMCHECK=OFF
-DARROW_ENABLE_TIMING_TESTS=ON
-DARROW_BUILD_INTEGRATION=ON
-DARROW_BUILD_BENCHMARKS=OFF
-DARROW_BUILD_BENCHMARKS_REFERENCE=OFF
-DARROW_BUILD_OPENMP_BENCHMARKS=OFF
-DARROW_BUILD_DETAILED_BENCHMARKS=OFF
-DARROW_TEST_LINKAGE=shared
-DARROW_FUZZING=OFF
-DARROW_LARGE_MEMORY_TESTS=OFF
-DARROW_PARQUET=ON
-DPARQUET_REQUIRE_ENCRYPTION=ON
-DPARQUET_BUILD_EXAMPLES=ON
-DPARQUET_BUILD_EXECUTABLES=ON
-DARROW_ORC=ON
-DARROW_COMPUTE=ON
-DARROW_CSV=ON
-DARROW_DATASET=ON
-DARROW_EXTRA_ERROR_CONTEXT=ON
-DARROW_FILESYSTEM=ON
-DARROW_JEMALLOC=ON
-DARROW_JEMALLOC_USE_SHARED=ON
-DARROW_JSON=ON
-DARROW_S3=OFF
-DARROW_HDFS=ON
-DARROW_GCS=OFF
-DARROW_FLIGHT=ON
–DARROW_FLIGHT_SQL=ON
-DARROW_GANDIVA=OFF
-DARROW_PLASMA=ON
-DARROW_SUBSTRAIT=ON
-DARROW_BUILD_UTILITIES=ON
-DARROW_EXTRA_ERROR_CONTEXT=ON
-DARROW_USE_CCACHE=ON
-DCMAKE_UNITY_BUILD=OFF
-DARROW_USE_PRECOMPILED_HEADERS=OFF
-DBUILD_WARNING_LEVEL=CHECKIN
-DARROW_WITH_OPENTELEMETRY=OFF
-DARROW_WITH_BZ2=ON
-DARROW_WITH_ZLIB=ON
-DARROW_WITH_ZSTD=ON
-DARROW_WITH_LZ4=ON
-DARROW_WITH_THRIFT=ON
-DARROW_WITH_SNAPPY=ON
-DARROW_WITH_BROTLI=ON
-DARROW_DEPENDENCY_SOURCE=CONDA
-DGTest_SOURCE=BUNDLED
-DCMAKE_OSX_ARCHITECTURES=arm64
-DCMAKE_C_COMPILER=/usr/bin/clang
-DCMAKE_CXX_COMPILER=/usr/bin/clang++
-DCMAKE_C_STANDARD=17
-DCMAKE_CXX_STANDARD=17

And after I installed the C++ arrow libs (in the conda env) I got this error (when I was trying to build the python bindings):

        Could not find a package configuration file provided by "ArrowFlightSql"
        with any of the following names:

          ArrowFlightSqlConfig.cmake
          arrowflightsql-config.cmake

I exported these env vars before trying to install the python binding:

export PYARROW_WITH_FLIGHT=ON
export PYARROW_WITH_DATASET=ON
export PYARROW_WITH_PARQUET=ON
export PYARROW_WITH_FLIGHT=ON
export PYARROW_WITH_FLIGHT_SQL=ON

These are the arrow libs that were installed:

ls $CONDA_PREFIX/lib | grep arrow
libarrow.1000.0.0.dylib
libarrow.1000.dylib
libarrow.dylib
libarrow_dataset.1000.0.0.dylib
libarrow_dataset.1000.dylib
libarrow_dataset.dylib
libarrow_flight.1000.0.0.dylib
libarrow_flight.1000.dylib
libarrow_flight.dylib
libarrow_flight_testing.1000.0.0.dylib
libarrow_flight_testing.1000.dylib
libarrow_flight_testing.dylib
libarrow_substrait.1000.0.0.dylib
libarrow_substrait.1000.dylib
libarrow_substrait.dylib
libarrow_testing.1000.0.0.dylib
libarrow_testing.1000.dylib
libarrow_testing.dylib

I did check the output for the cmake summary and it seems the option (–DARROW_FLIGHT_SQL=ON) is not being captured:

--   ARROW_FLIGHT=ON [default=OFF]
--       Build the Arrow Flight RPC System (requires GRPC, Protocol Buffers)
--   ARROW_FLIGHT_SQL=OFF [default=OFF]
--       Build the Arrow Flight SQL extension

aucahuasi avatar Oct 28 '22 23:10 aucahuasi

@jduo the last commit adds a low-level option to specify headers for every call. We can update the driver once we specify the URL format.

lidavidm avatar Dec 05 '22 21:12 lidavidm

@pitrou would you be able to look at this at some point?

The main thing is the Python bindings have an optional dependency on the adbc_driver_manager package, which is not yet released. This also complicates writing tests for it (maybe we can have a Crossbow job to run tests against the nightly packages?)

lidavidm avatar Dec 06 '22 14:12 lidavidm

The main thing is the Python bindings have an optional dependency on the adbc_driver_manager package, which is not yet released.

Is it possible to do a 0.1 release quickly?

pitrou avatar Dec 06 '22 14:12 pitrou

In progress :slightly_smiling_face: I'm hoping to have the scripts ready this week, at which point either I need to get my GPG key into KEYS or I need to find another PMC to help.

lidavidm avatar Dec 06 '22 14:12 lidavidm

The main thing is the Python bindings have an optional dependency on the adbc_driver_manager package, which is not yet released. This also complicates writing tests for it (maybe we can have a Crossbow job to run tests against the nightly packages?)

You could simply install it from an arbitrary git changeset, e.g. git+https://github.com/apache/arrow-adbc.git@f1d84a3ba659f0c00fbc4ea5d8ae806ebab3bf28#subdirectory=python/adbc_driver_manager.

pitrou avatar Dec 06 '22 21:12 pitrou

Hi @lidavidm - I've built your branch from source and am attempting to test the Python ADBC connectivity to a Flight SQL server I'm running. I am using TLS in my testing, but the certificate is self-signed, so it is failing certificate validation when I connect.

Here is the Python testing code:

import pyarrow.flight_sql
from dotenv import load_dotenv

# Load our environment for the password...
load_dotenv(dotenv_path=".env")

uri = ("grpc+tls://localhost:31337?"
       "useEncryption=true"
       f"&user=flight_username&password={os.environ['FLIGHT_PASSWORD']}"
       "&disable_server_verification=True"
       )
conn = pyarrow.flight_sql.connect(uri=uri)
cur = conn.cursor()
cur.execute("SELECT 1, 'hi', 2.0")
cur.fetch_arrow_table()

Please note - I'm attempting to use the uri format per the Arrow Flight SQL JDBC driver specification, which is likely incorrect.

Can you let me know the way to disable certificate validation to test the functionality?

prmoore77 avatar Dec 07 '22 21:12 prmoore77

hi @lidavidm - in addition to disabling certificate validation (details in my comment above) - I wanted to know how best to handle authentication.

While testing the Arrow Flight SQL JDBC driver - I noticed that the handshake process seems as follows:

  1. Client sends authentication "basic" header with id/password to server
  2. Server middleware validates id/password per custom code.
  3. Server sends base64-encoded JWT back to Client
  4. Client sends authentication "bearer" header with JWT for subsequent calls.

I realize I can likely set an auth header with the server-signed JWT, but it would be nice if the Python ADBC client did the same authentication steps as the JDBC driver for consistency.

prmoore77 avatar Dec 07 '22 21:12 prmoore77

The JDBC URI format is not standardized. I can map the C++ options as individual connection options, but I don't want to parse them out of the URI.

lidavidm avatar Dec 08 '22 14:12 lidavidm

After the latest changes to flight_sql.py - I attempted to test with:

import pyarrow.flight_sql
from dotenv import load_dotenv
import base64

# Load our environment for the password...
load_dotenv(dotenv_path=".env")

flight_password = os.environ["FLIGHT_PASSWORD"]
authorization_header = f"Basic {str(base64.b64encode(bytes(f'flight_username:{flight_password}', encoding='utf-8')), encoding='utf-8')}"
conn = pyarrow.flight_sql.connect(uri="grpc+tls://localhost:31337",
                                  db_kwargs={"adbc.flight.sql.authorization_header": authorization_header,
                                             "adbc.flight.sql.client_option.disable_server_verification": "true"
                                             },
                                  conn_kwargs=dict()
                                  )

Getting:

Ssl handshake failed: SSL_ERROR_SSL: error:1416F086:SSL routines:tls_process_server_certificate:certificate verify failed.

Can you advise? I'm not sure what to put in conn_kwargs - it is a required arg. I attempted moving the server verification key/value into that arg, but to no avail. Should we default it to any empty dict?

prmoore77 avatar Dec 09 '22 13:12 prmoore77

@prmoore77 it should be "arrow..." not "adbc..."

I'll fix the kwargs - I'll address Python side next (as best as I can, since we won't have a test server to run against)

The C++ feedback should all be addressed.

lidavidm avatar Dec 09 '22 14:12 lidavidm

I added some more Python smoke tests.

Hopefully, I'll have a formal ADBC release out in the next few weeks. Then we can just install it directly in one of the CI configurations, and avoid going through pip install from Git.

lidavidm avatar Dec 09 '22 18:12 lidavidm

Hi @lidavidm - all is working now. Great work! Here is the Python and output:

import os

from dotenv import load_dotenv

import pyarrow.flight_sql

# Load our environment for the password...
load_dotenv(dotenv_path=".env")

flight_password = os.environ["FLIGHT_PASSWORD"]
authorization_header = f"Basic {str(base64.b64encode(bytes(f'flight_username:{flight_password}', encoding='utf-8')), encoding='utf-8')}"

with pyarrow.flight_sql.connect(uri="grpc+tls://localhost:31337",
                                db_kwargs={"arrow.flight.sql.authorization_header": authorization_header,
                                           "arrow.flight.sql.client_option.disable_server_verification": "true"
                                           }
                                ) as conn:
    with conn.cursor() as cur:
        cur.execute("SELECT 1, 'hi', 2.0")
        x = cur.fetch_arrow_table()
        print(x)

Output:

pyarrow.Table
1: int64
'hi': string
2.0: double
----
1: [[1]]
'hi': [["hi"]]
2.0: [[2]]

prmoore77 avatar Dec 09 '22 21:12 prmoore77

@pitrou thanks for your review, I believe I've addressed everything here

lidavidm avatar Dec 13 '22 17:12 lidavidm

@github-actions crossbow submit -g wheel

lidavidm avatar Jan 03 '23 16:01 lidavidm

Revision: 5c17714f7ee588ad18ad11f3749014b1b208b503

Submitted crossbow builds: ursacomputing/crossbow @ actions-081694e550

Task Status
wheel-macos-big-sur-cp310-arm64 Github Actions
wheel-macos-big-sur-cp311-arm64 Github Actions
wheel-macos-big-sur-cp38-arm64 Github Actions
wheel-macos-big-sur-cp39-arm64 Github Actions
wheel-macos-mojave-cp310-amd64 Github Actions
wheel-macos-mojave-cp311-amd64 Github Actions
wheel-macos-mojave-cp37-amd64 Github Actions
wheel-macos-mojave-cp38-amd64 Github Actions
wheel-macos-mojave-cp39-amd64 Github Actions
wheel-manylinux2014-cp310-amd64 Github Actions
wheel-manylinux2014-cp310-arm64 Travis CI
wheel-manylinux2014-cp311-amd64 Github Actions
wheel-manylinux2014-cp311-arm64 Travis CI
wheel-manylinux2014-cp37-amd64 Github Actions
wheel-manylinux2014-cp37-arm64 Travis CI
wheel-manylinux2014-cp38-amd64 Github Actions
wheel-manylinux2014-cp38-arm64 Travis CI
wheel-manylinux2014-cp39-amd64 Github Actions
wheel-manylinux2014-cp39-arm64 Travis CI
wheel-windows-cp310-amd64 Github Actions
wheel-windows-cp311-amd64 Github Actions
wheel-windows-cp37-amd64 Github Actions
wheel-windows-cp38-amd64 Github Actions
wheel-windows-cp39-amd64 Github Actions

github-actions[bot] avatar Jan 03 '23 16:01 github-actions[bot]

@github-actions crossbow submit -g wheel

lidavidm avatar Jan 10 '23 19:01 lidavidm

Revision: 2dbfe9af075cb12f726cfa79854a5d812db45d08

Submitted crossbow builds: ursacomputing/crossbow @ actions-50a9b677b1

Task Status
wheel-macos-big-sur-cp310-arm64 Github Actions
wheel-macos-big-sur-cp311-arm64 Github Actions
wheel-macos-big-sur-cp38-arm64 Github Actions
wheel-macos-big-sur-cp39-arm64 Github Actions
wheel-macos-mojave-cp310-amd64 Github Actions
wheel-macos-mojave-cp311-amd64 Github Actions
wheel-macos-mojave-cp37-amd64 Github Actions
wheel-macos-mojave-cp38-amd64 Github Actions
wheel-macos-mojave-cp39-amd64 Github Actions
wheel-manylinux2014-cp310-amd64 Github Actions
wheel-manylinux2014-cp310-arm64 Travis CI
wheel-manylinux2014-cp311-amd64 Github Actions
wheel-manylinux2014-cp311-arm64 Travis CI
wheel-manylinux2014-cp37-amd64 Github Actions
wheel-manylinux2014-cp37-arm64 Travis CI
wheel-manylinux2014-cp38-amd64 Github Actions
wheel-manylinux2014-cp38-arm64 Travis CI
wheel-manylinux2014-cp39-amd64 Github Actions
wheel-manylinux2014-cp39-arm64 Travis CI
wheel-windows-cp310-amd64 Github Actions
wheel-windows-cp311-amd64 Github Actions
wheel-windows-cp37-amd64 Github Actions
wheel-windows-cp38-amd64 Github Actions
wheel-windows-cp39-amd64 Github Actions

github-actions[bot] avatar Jan 10 '23 19:01 github-actions[bot]

@github-actions crossbow submit wheel-windows-cp311-amd64

lidavidm avatar Jan 10 '23 21:01 lidavidm

Revision: 3156ca33d322e116d742e4f469ce55ae58eb10bc

Submitted crossbow builds: ursacomputing/crossbow @ actions-9b624ae977

Task Status
wheel-windows-cp311-amd64 Github Actions

github-actions[bot] avatar Jan 10 '23 21:01 github-actions[bot]

@github-actions crossbow submit -g wheel

lidavidm avatar Jan 10 '23 22:01 lidavidm

Revision: 3156ca33d322e116d742e4f469ce55ae58eb10bc

Submitted crossbow builds: ursacomputing/crossbow @ actions-427a62a820

Task Status
wheel-macos-big-sur-cp310-arm64 Github Actions
wheel-macos-big-sur-cp311-arm64 Github Actions
wheel-macos-big-sur-cp38-arm64 Github Actions
wheel-macos-big-sur-cp39-arm64 Github Actions
wheel-macos-mojave-cp310-amd64 Github Actions
wheel-macos-mojave-cp311-amd64 Github Actions
wheel-macos-mojave-cp37-amd64 Github Actions
wheel-macos-mojave-cp38-amd64 Github Actions
wheel-macos-mojave-cp39-amd64 Github Actions
wheel-manylinux2014-cp310-amd64 Github Actions
wheel-manylinux2014-cp310-arm64 Travis CI
wheel-manylinux2014-cp311-amd64 Github Actions
wheel-manylinux2014-cp311-arm64 Travis CI
wheel-manylinux2014-cp37-amd64 Github Actions
wheel-manylinux2014-cp37-arm64 Travis CI
wheel-manylinux2014-cp38-amd64 Github Actions
wheel-manylinux2014-cp38-arm64 Travis CI
wheel-manylinux2014-cp39-amd64 Github Actions
wheel-manylinux2014-cp39-arm64 Travis CI
wheel-windows-cp310-amd64 Github Actions
wheel-windows-cp311-amd64 Github Actions
wheel-windows-cp37-amd64 Github Actions
wheel-windows-cp38-amd64 Github Actions
wheel-windows-cp39-amd64 Github Actions

github-actions[bot] avatar Jan 10 '23 22:01 github-actions[bot]

  • Closes: #32901

github-actions[bot] avatar Jan 11 '23 13:01 github-actions[bot]