PyHive icon indicating copy to clipboard operation
PyHive copied to clipboard

INSERT/executemany fails

Open mschmill opened this issue 8 years ago • 7 comments

I am trying to use pandas to insert a batch of data to a Hive table and it bombs after the first insert. PyHive seems to try to get a result set after each insert and does not get one, breaking the executemany:

File "/usr/anaconda2/lib/python2.7/site-packages/pandas/core/generic.py", line 1160, in to_sql
    chunksize=chunksize, dtype=dtype)
  File "/usr/anaconda2/lib/python2.7/site-packages/pandas/io/sql.py", line 571, in to_sql
    chunksize=chunksize, dtype=dtype)
  File "/usr/anaconda2/lib/python2.7/site-packages/pandas/io/sql.py", line 1250, in to_sql
    table.insert(chunksize)
  File "/usr/anaconda2/lib/python2.7/site-packages/pandas/io/sql.py", line 770, in insert
    self._execute_insert(conn, keys, chunk_iter)
  File "/usr/anaconda2/lib/python2.7/site-packages/pandas/io/sql.py", line 745, in _execute_insert
    conn.execute(self.insert_statement(), data)
  File "/usr/anaconda2/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 914, in execute
    return meth(self, multiparams, params)
  File "/usr/anaconda2/lib/python2.7/site-packages/sqlalchemy/sql/elements.py", line 323, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "/usr/anaconda2/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1010, in _execute_clauseelement
    compiled_sql, distilled_params
  File "/usr/anaconda2/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1146, in _execute_context
    context)
  File "/usr/anaconda2/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1341, in _handle_dbapi_exception
    exc_info
  File "/usr/anaconda2/lib/python2.7/site-packages/sqlalchemy/util/compat.py", line 200, in raise_from_cause
    reraise(type(exception), exception, tb=exc_tb, cause=cause)
  File "/usr/anaconda2/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1116, in _execute_context
    context)
  File "/usr/anaconda2/lib/python2.7/site-packages/sqlalchemy/engine/default.py", line 447, in do_executemany
    cursor.executemany(statement, parameters)
  File "/usr/anaconda2/lib/python2.7/site-packages/pyhive/common.py", line 84, in executemany
    self._fetch_more()
  File "/usr/anaconda2/lib/python2.7/site-packages/pyhive/hive.py", line 228, in _fetch_more
    raise ProgrammingError("No result set")
sqlalchemy.exc.ProgrammingError: (pyhive.exc.ProgrammingError) No result set [SQL: u'INSERT INTO TABLE

But there is another issue, which is that this is not performant at all in the way the batch insert is generated. It generates a separate insert per row, which causes Hive to create a MR job for each row. Is there a better way to handle a batch insert like this?

mschmill avatar May 27 '16 19:05 mschmill

Yeah this is a bug, though I'm more inclined to just delete executemany since it's pointlessly inefficient as you mention.

You can insert a batch of rows using this syntax: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-InsertingvaluesintotablesfromSQL

It would be neat to handle that automatically, but it can't be done cleanly in executemany because that would require parsing and rewriting the input query. Maybe there's some SQLAlchemy flag to generate a single INSERT ... VALUES statement instead of using executemany. I couldn't find anything obvious from Googling that.

jingw avatar May 31 '16 20:05 jingw

i have the same problem too on python3.7 about executemany fails

/usr/local/python37/lib/python3.7/site-packages/sqlalchemy/engine/default.py in do_executemany(self, cursor, statement, parameters, context) 531 532 def do_executemany(self, cursor, statement, parameters, context=None): -- 533 cursor.executemany(statement, parameters) 534 535 def do_execute(self, cursor, statement, parameters, context=None):

/usr/local/python37/lib/python3.7/site-packages/pyhive/common.py in executemany(self, operation, seq_of_parameters) 88 self.execute(operation, parameters) 89 while self._state != self._STATE_FINISHED: --- 90 self._fetch_more() 91 if seq_of_parameters: 92 self.execute(operation, seq_of_parameters[-1])

/usr/local/python37/lib/python3.7/site-packages/pyhive/hive.py in _fetch_more(self) 378 assert(self._operationHandle is not None), "Should have an op handle in _fetch_more" 379 if not self._operationHandle.hasResultSet: -- 380 raise ProgrammingError("No result set") 381 req = ttypes.TFetchResultsReq( 382 operationHandle=self._operationHandle,

ProgrammingError: (pyhive.exc.ProgrammingError) No result set [SQL: 'INSERT INTO TABLE bi_middles.tem_st_regions_by_kh VALUES (%(stncode)s, %(stnname1)s, %(region1)s, %(sm_num1)s)'] [parameters: ({'..........})] (Background on this error at: http://sqlalche.me/e/f405)

finally i got one row on this hive table

aa3222119 avatar May 09 '19 08:05 aa3222119

See https://github.com/dropbox/PyHive/issues/250 for a monkey patch, which works, but is hopelessly inefficient, as mentioned.

You can also generate a multiline SQL statement and execute using the cursor, which is what I ended up doing (but couldn't find a way to parameterise the string). Finally, probably the best option, is to write your CSV in to HDFS and use the CSV batch import in Hive.

morganics avatar Nov 06 '19 20:11 morganics

Any fix / workaround for Hive on inserting a batch of data? Other than uploading to HDFS or s3 bucket, or monkey patching the monkey patch pyhive.hive.Cursor's _fetch_more() method.

wilberh avatar Jul 01 '20 16:07 wilberh

Here's a workaround - Chunking csv files using panda's dataframe.

https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-chunking https://medium.com/towards-artificial-intelligence/efficient-pandas-using-chunksize-for-large-data-sets-c66bf3037f93 https://github.com/dropbox/PyHive/issues/55

wilberh avatar Jul 02 '20 02:07 wilberh

Another workaround using to_sql's method="multi" option is described here:

https://github.com/dropbox/PyHive/issues/250#issuecomment-658778041

gordthompson avatar Jul 17 '20 22:07 gordthompson

Another workaround using to_sql's method="multi" option is described here:

https://github.com/dropbox/PyHive/issues/250#issuecomment-658778041

It works!Thanks!

zhujinqiu avatar Mar 04 '24 16:03 zhujinqiu