PyHive
PyHive copied to clipboard
Produce rows as slices of pyarrow.Table
I've been experimenting how to retrieve Hive resultsets in a more efficient way than a list of python tuples. For efficient computation, we need columnar representation and preferrably in contiguous memory locations.
This pull request is just an experiment with several ideas in mind:
- to exploit the fact that Hive Thrift protocol already sends the data in columnar format
- that numeric values are sent in binary (albeit big-endian order...)
- use arrow to represent the rows retrieved
- deliver chunks of rows as slices of pyarrow.Table
This minimizes data copying and parsing around, as the data will mostly be consumed by pandas/numpy, that should need no conversion whatsoever.
I know is a crude hack, mostly the edits on TCLIService.py, but any comments are welcome :)
As an example
from pyhive import hive
conn = hive.connect(...)
cursor = conn.cursor()
sql = """\
SELECT
CAST(id AS TINYINT) AS i1,
CAST(id AS SMALLINT) AS i2,
CAST(id AS INT) AS i4,
CAST(IF(id%2=0, NULL, id*id) AS BIGINT) AS i8,
CAST(ra AS FLOAT) AS f4,
CAST(ra AS DOUBLE) AS f8,
CAST(IF(id%2=0, True, False) AS BOOLEAN) AS b,
CAST(id AS STRING) AS s,
CAST(id AS VARCHAR(4)) AS v,
CAST(id AS CHAR(4)) AS c,
CAST(CAST(id AS STRING) AS BINARY) AS o,
CAST(id AS TIMESTAMP) AS ts,
CAST(CAST(id AS TIMESTAMP) AS DATE) AS d,
(INTERVAL (id) MONTHS) AS iym,
(INTERVAL (id) SECONDS) AS ids,
CAST(ra AS DECIMAL(6,3)) AS n,
ARRAY(g_des_true, r_des_true, NULL, i_des_true, z_des_true, y_des_true) AS a,
NAMED_STRUCT('id', id, 'ra', ra, 'dec', CAST(`dec` AS STRING)) AS ns,
MAP('ra', ra, 'dec', `dec`, flag, id) AS m
FROM test
LIMIT 29
"""
cursor.execute("{0}".format(sql))
print(cursor.description)
[('i1', 'TINYINT_TYPE', None, None, None, None, True),
('i2', 'SMALLINT_TYPE', None, None, None, None, True),
('i4', 'INT_TYPE', None, None, None, None, True),
('i8', 'BIGINT_TYPE', None, None, None, None, True),
('f4', 'FLOAT_TYPE', None, None, None, None, True),
('f8', 'DOUBLE_TYPE', None, None, None, None, True),
('b', 'BOOLEAN_TYPE', None, None, None, None, True),
('s', 'STRING_TYPE', None, None, None, None, True),
('v', 'VARCHAR_TYPE', None, None, None, None, True),
('c', 'CHAR_TYPE', None, None, None, None, True),
('o', 'BINARY_TYPE', None, None, None, None, True),
('ts', 'TIMESTAMP_TYPE', None, None, None, None, True),
('d', 'DATE_TYPE', None, None, None, None, True),
('iym', 'INTERVAL_YEAR_MONTH_TYPE', None, None, None, None, True),
('ids', 'INTERVAL_DAY_TIME_TYPE', None, None, None, None, True),
('n', 'DECIMAL_TYPE', None, None, None, None, True),
('a', 'ARRAY_TYPE', None, None, None, None, True),
('ns', 'STRUCT_TYPE', None, None, None, None, True),
('m', 'MAP_TYPE', None, None, None, None, True)]
And the resulting dataset
cursor.fetchmany(10)
pyarrow.Table
i1: int8
i2: int16
i4: int32
i8: int64
f4: float
f8: double
b: bool
s: string
v: string
c: string
o: binary
ts: timestamp[us]
d: date32[day]
iym: string
ids: duration[us]
n: decimal128(5, 3)
a: list<item: double>
child 0, item: double
ns: struct<id: int64, ra: double, dec: string>
child 0, id: int64
child 1, ra: double
child 2, dec: string
m: struct<ra: double, dec: double, 0: double, 1: double>
child 0, ra: double
child 1, dec: double
child 2, 0: double
child 3, 1: double
----
i1: [[-78,-78,-78,-78,-78,-78,-78,-78,-78,-78]]
i2: [[-17742,-24654,-23374,12466,16050,-25422,-2894,17074,-20558,17586]]
i4: [[169327282,145465266,94479538,43135154,139935410,59874482,142537906,204161714,30977970,4080818]]
i8: [[null,null,null,null,null,null,null,null,null,null]]
f4: [[73.38003,8.149847,25.408705,78.80194,22.99389,2.215195,37.569447,39.51681,82.01245,54.17024]]
f8: [[73.380028,8.149847,25.408705,78.801941,22.993891,2.215195,37.569447,39.516811,82.012451,54.170238]]
b: [[true,true,true,true,true,true,true,true,true,true]]
s: [["169327282","145465266","94479538","43135154","139935410","59874482","142537906","204161714","30977970","4080818"]]
v: [["1693","1454","9447","4313","1399","5987","1425","2041","3097","4080"]]
c: [["1693","1454","9447","4313","1399","5987","1425","2041","3097","4080"]]
The only limitation is that MAP columns are read as STRUCTs, because how pyarrow parses the internal JSON representation that Hive is sending :(