PyHive
PyHive copied to clipboard
INSERT/executemany fails
I am trying to use pandas to insert a batch of data to a Hive table and it bombs after the first insert. PyHive seems to try to get a result set after each insert and does not get one, breaking the executemany:
File "/usr/anaconda2/lib/python2.7/site-packages/pandas/core/generic.py", line 1160, in to_sql
chunksize=chunksize, dtype=dtype)
File "/usr/anaconda2/lib/python2.7/site-packages/pandas/io/sql.py", line 571, in to_sql
chunksize=chunksize, dtype=dtype)
File "/usr/anaconda2/lib/python2.7/site-packages/pandas/io/sql.py", line 1250, in to_sql
table.insert(chunksize)
File "/usr/anaconda2/lib/python2.7/site-packages/pandas/io/sql.py", line 770, in insert
self._execute_insert(conn, keys, chunk_iter)
File "/usr/anaconda2/lib/python2.7/site-packages/pandas/io/sql.py", line 745, in _execute_insert
conn.execute(self.insert_statement(), data)
File "/usr/anaconda2/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 914, in execute
return meth(self, multiparams, params)
File "/usr/anaconda2/lib/python2.7/site-packages/sqlalchemy/sql/elements.py", line 323, in _execute_on_connection
return connection._execute_clauseelement(self, multiparams, params)
File "/usr/anaconda2/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1010, in _execute_clauseelement
compiled_sql, distilled_params
File "/usr/anaconda2/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1146, in _execute_context
context)
File "/usr/anaconda2/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1341, in _handle_dbapi_exception
exc_info
File "/usr/anaconda2/lib/python2.7/site-packages/sqlalchemy/util/compat.py", line 200, in raise_from_cause
reraise(type(exception), exception, tb=exc_tb, cause=cause)
File "/usr/anaconda2/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1116, in _execute_context
context)
File "/usr/anaconda2/lib/python2.7/site-packages/sqlalchemy/engine/default.py", line 447, in do_executemany
cursor.executemany(statement, parameters)
File "/usr/anaconda2/lib/python2.7/site-packages/pyhive/common.py", line 84, in executemany
self._fetch_more()
File "/usr/anaconda2/lib/python2.7/site-packages/pyhive/hive.py", line 228, in _fetch_more
raise ProgrammingError("No result set")
sqlalchemy.exc.ProgrammingError: (pyhive.exc.ProgrammingError) No result set [SQL: u'INSERT INTO TABLE
But there is another issue, which is that this is not performant at all in the way the batch insert is generated. It generates a separate insert per row, which causes Hive to create a MR job for each row. Is there a better way to handle a batch insert like this?
Yeah this is a bug, though I'm more inclined to just delete executemany since it's pointlessly inefficient as you mention.
You can insert a batch of rows using this syntax: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-InsertingvaluesintotablesfromSQL
It would be neat to handle that automatically, but it can't be done cleanly in executemany
because that would require parsing and rewriting the input query. Maybe there's some SQLAlchemy flag to generate a single INSERT ... VALUES
statement instead of using executemany
. I couldn't find anything obvious from Googling that.
i have the same problem too on python3.7 about executemany fails
/usr/local/python37/lib/python3.7/site-packages/sqlalchemy/engine/default.py in do_executemany(self, cursor, statement, parameters, context) 531 532 def do_executemany(self, cursor, statement, parameters, context=None): -- 533 cursor.executemany(statement, parameters) 534 535 def do_execute(self, cursor, statement, parameters, context=None):
/usr/local/python37/lib/python3.7/site-packages/pyhive/common.py in executemany(self, operation, seq_of_parameters) 88 self.execute(operation, parameters) 89 while self._state != self._STATE_FINISHED: --- 90 self._fetch_more() 91 if seq_of_parameters: 92 self.execute(operation, seq_of_parameters[-1])
/usr/local/python37/lib/python3.7/site-packages/pyhive/hive.py in _fetch_more(self) 378 assert(self._operationHandle is not None), "Should have an op handle in _fetch_more" 379 if not self._operationHandle.hasResultSet: -- 380 raise ProgrammingError("No result set") 381 req = ttypes.TFetchResultsReq( 382 operationHandle=self._operationHandle,
ProgrammingError: (pyhive.exc.ProgrammingError) No result set [SQL: 'INSERT INTO TABLE bi_middles
.tem_st_regions_by_kh
VALUES (%(stncode)s, %(stnname1)s, %(region1)s, %(sm_num1)s)'] [parameters: ({'..........})] (Background on this error at: http://sqlalche.me/e/f405)
finally i got one row on this hive table
See https://github.com/dropbox/PyHive/issues/250 for a monkey patch, which works, but is hopelessly inefficient, as mentioned.
You can also generate a multiline SQL statement and execute using the cursor, which is what I ended up doing (but couldn't find a way to parameterise the string). Finally, probably the best option, is to write your CSV in to HDFS and use the CSV batch import in Hive.
Any fix / workaround for Hive on inserting a batch of data? Other than uploading to HDFS or s3 bucket, or monkey patching the monkey patch pyhive.hive.Cursor's _fetch_more() method.
Here's a workaround - Chunking csv files using panda's dataframe.
https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-chunking https://medium.com/towards-artificial-intelligence/efficient-pandas-using-chunksize-for-large-data-sets-c66bf3037f93 https://github.com/dropbox/PyHive/issues/55
Another workaround using to_sql's method="multi"
option is described here:
https://github.com/dropbox/PyHive/issues/250#issuecomment-658778041
Another workaround using to_sql's
method="multi"
option is described here:https://github.com/dropbox/PyHive/issues/250#issuecomment-658778041
It works!Thanks!