PyHive icon indicating copy to clipboard operation
PyHive copied to clipboard

Produce rows as slices of pyarrow.Table

Open ptallada opened this issue 3 years ago • 2 comments

I've been experimenting how to retrieve Hive resultsets in a more efficient way than a list of python tuples. For efficient computation, we need columnar representation and preferrably in contiguous memory locations.

This pull request is just an experiment with several ideas in mind:

  • to exploit the fact that Hive Thrift protocol already sends the data in columnar format
  • that numeric values are sent in binary (albeit big-endian order...)
  • use arrow to represent the rows retrieved
  • deliver chunks of rows as slices of pyarrow.Table

This minimizes data copying and parsing around, as the data will mostly be consumed by pandas/numpy, that should need no conversion whatsoever.

I know is a crude hack, mostly the edits on TCLIService.py, but any comments are welcome :)

ptallada avatar Feb 08 '22 17:02 ptallada

As an example

from pyhive import hive

conn = hive.connect(...)
cursor = conn.cursor()

sql = """\
SELECT
  CAST(id AS TINYINT) AS i1,
  CAST(id AS SMALLINT) AS i2,
  CAST(id AS INT) AS i4,
  CAST(IF(id%2=0, NULL, id*id) AS BIGINT) AS i8,
  CAST(ra AS FLOAT) AS f4,
  CAST(ra AS DOUBLE) AS f8,
  CAST(IF(id%2=0, True, False) AS BOOLEAN) AS b,
  CAST(id AS STRING) AS s,
  CAST(id AS VARCHAR(4)) AS v,
  CAST(id AS CHAR(4)) AS c,
  CAST(CAST(id AS STRING) AS BINARY) AS o,
  CAST(id AS TIMESTAMP) AS ts,
  CAST(CAST(id AS TIMESTAMP) AS DATE) AS d,
  (INTERVAL (id) MONTHS) AS iym,
  (INTERVAL (id) SECONDS) AS ids,
  CAST(ra AS DECIMAL(6,3)) AS n,
  ARRAY(g_des_true, r_des_true, NULL, i_des_true, z_des_true, y_des_true) AS a,
  NAMED_STRUCT('id', id, 'ra', ra, 'dec', CAST(`dec` AS STRING)) AS ns,
  MAP('ra', ra, 'dec', `dec`, flag, id) AS m
FROM test
LIMIT 29
"""
cursor.execute("{0}".format(sql))

print(cursor.description)
[('i1', 'TINYINT_TYPE', None, None, None, None, True),
 ('i2', 'SMALLINT_TYPE', None, None, None, None, True),
 ('i4', 'INT_TYPE', None, None, None, None, True),
 ('i8', 'BIGINT_TYPE', None, None, None, None, True),
 ('f4', 'FLOAT_TYPE', None, None, None, None, True),
 ('f8', 'DOUBLE_TYPE', None, None, None, None, True),
 ('b', 'BOOLEAN_TYPE', None, None, None, None, True),
 ('s', 'STRING_TYPE', None, None, None, None, True),
 ('v', 'VARCHAR_TYPE', None, None, None, None, True),
 ('c', 'CHAR_TYPE', None, None, None, None, True),
 ('o', 'BINARY_TYPE', None, None, None, None, True),
 ('ts', 'TIMESTAMP_TYPE', None, None, None, None, True),
 ('d', 'DATE_TYPE', None, None, None, None, True),
 ('iym', 'INTERVAL_YEAR_MONTH_TYPE', None, None, None, None, True),
 ('ids', 'INTERVAL_DAY_TIME_TYPE', None, None, None, None, True),
 ('n', 'DECIMAL_TYPE', None, None, None, None, True),
 ('a', 'ARRAY_TYPE', None, None, None, None, True),
 ('ns', 'STRUCT_TYPE', None, None, None, None, True),
 ('m', 'MAP_TYPE', None, None, None, None, True)]

And the resulting dataset

cursor.fetchmany(10)
pyarrow.Table
i1: int8
i2: int16
i4: int32
i8: int64
f4: float
f8: double
b: bool
s: string
v: string
c: string
o: binary
ts: timestamp[us]
d: date32[day]
iym: string
ids: duration[us]
n: decimal128(5, 3)
a: list<item: double>
  child 0, item: double
ns: struct<id: int64, ra: double, dec: string>
  child 0, id: int64
  child 1, ra: double
  child 2, dec: string
m: struct<ra: double, dec: double, 0: double, 1: double>
  child 0, ra: double
  child 1, dec: double
  child 2, 0: double
  child 3, 1: double
----
i1: [[-78,-78,-78,-78,-78,-78,-78,-78,-78,-78]]
i2: [[-17742,-24654,-23374,12466,16050,-25422,-2894,17074,-20558,17586]]
i4: [[169327282,145465266,94479538,43135154,139935410,59874482,142537906,204161714,30977970,4080818]]
i8: [[null,null,null,null,null,null,null,null,null,null]]
f4: [[73.38003,8.149847,25.408705,78.80194,22.99389,2.215195,37.569447,39.51681,82.01245,54.17024]]
f8: [[73.380028,8.149847,25.408705,78.801941,22.993891,2.215195,37.569447,39.516811,82.012451,54.170238]]
b: [[true,true,true,true,true,true,true,true,true,true]]
s: [["169327282","145465266","94479538","43135154","139935410","59874482","142537906","204161714","30977970","4080818"]]
v: [["1693","1454","9447","4313","1399","5987","1425","2041","3097","4080"]]
c: [["1693","1454","9447","4313","1399","5987","1425","2041","3097","4080"]]

The only limitation is that MAP columns are read as STRUCTs, because how pyarrow parses the internal JSON representation that Hive is sending :(

ptallada avatar Feb 08 '22 17:02 ptallada

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar Apr 16 '22 21:04 CLAassistant