SW stopping on too many HBase retries
My SW is stopping due to an unhandled exception:
Unhandled error in Deferred:
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/base.py", line 1194, in run
self.mainLoop()
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/base.py", line 1203, in mainLoop
self.runUntilCurrent()
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/base.py", line 825, in runUntilCurrent
call.func(*call.args, **call.kw)
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/task.py", line 213, in __call__
d = defer.maybeDeferred(self.f, *self.a, **self.kw)
--- <exception caught here> ---
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 150, in maybeDeferred
result = f(*args, **kw)
File "/media/sf_distributed_llshrimp/debug/frontera/frontera/worker/strategy.py", line 80, in work
self.states.fetch(fingerprints)
File "/media/sf_distributed_llshrimp/debug/frontera/frontera/contrib/backends/hbase.py", line 306, in fetch
records = table.rows(keys, columns=['s:state'])
File "/usr/local/lib/python2.7/dist-packages/happybase/table.py", line 155, in rows
self.name, rows, columns, {})
File "/usr/local/lib/python2.7/dist-packages/happybase/hbase/Hbase.py", line 1358, in getRowsWithColumns
return self.recv_getRowsWithColumns()
File "/usr/local/lib/python2.7/dist-packages/happybase/hbase/Hbase.py", line 1384, in recv_getRowsWithColumns
raise result.io
happybase.hbase.ttypes.IOError: IOError(_message='org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 27766 actions: IOException: 27766 times, \n\tat org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:228)\n\tat org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$1700(AsyncProcess.java:208)\n\tat org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.getErrors(AsyncProcess.java:1594)\n\tat org.apache.hadoop.hbase.client.HTable.batch(HTable.java:936)\n\tat org.apache.hadoop.hbase.client.HTable.batch(HTable.java:950)\n\tat org.apache.hadoop.hbase.client.HTable.get(HTable.java:911)\n\tat org.apache.hadoop.hbase.thrift.ThriftServerRunner$HBaseHandler.getRowsWithColumnsTs(ThriftServerRunner.java:1107)\n\tat org.apache.hadoop.hbase.thrift.ThriftServerRunner$HBaseHandler.getRowsWithColumns(ThriftServerRunner.java:1063)\n\tat sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)\n\tat sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat java.lang.reflect.Method.invoke(Method.java:606)\n\tat org.apache.hadoop.hbase.thrift.HbaseHandlerMetricsProxy.invoke(HbaseHandlerMetricsProxy.java:67)\n\tat com.sun.proxy.$Proxy9.getRowsWithColumns(Unknown Source)\n\tat org.apache.hadoop.hbase.thrift.generated.Hbase$Processor$getRowsWithColumns.getResult(Hbase.java:4262)\n\tat org.apache.hadoop.hbase.thrift.generated.Hbase$Processor$getRowsWithColumns.getResult(Hbase.java:4246)\n\tat org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)\n\tat org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)\n\tat org.apache.hadoop.hbase.thrift.TBoundedThreadPoolServer$ClientConnnection.run(TBoundedThreadPoolServer.java:289)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)\n\tat java.lang.Thread.run(Thread.java:745)\n')
After restarting the SW it is working again (but the consumption seems to work at a much smaller rate)
@lljrsr Please check HBase logs, it reports IOException on HBase side. I'm not sure if it's related to SW code, but would be nice to investigate.
Yes HBase is the root of the problem. The way I see it, is that HBase fails, which in turn makes the strategy worker fail. The strategy worker should then automatically exit and throw an error, instead of just stopping (otherwise it will not get automatically restarted by upstart or similiar).
Oh, that's interesting. What if we re-establish connection to HBase, after some time and continue operation?
I think handling this error would solve the problem. I would handle it by just exiting.
@lljrsr Loosing connection to HBase is common unfortunately. Not only because of physical network problems, but also because of HBase internal complexity. We could solve it by exiting, that's for sure. But think of a broad picture:
- What if people don't use any process management like upstart? Constantly exiting workers would cause them a lot of maintenance effort.
- Exiting and starting for process is computationally expensive: loading of python interpreter, allocating memory, ports, establishing connections (to all kafka brokers, BTW), flushing of state cache. My opinion is we should exit only in case of fatal error, where continuing isn't possible.
You are right. Reconnecting to HBase would probably be the better idea here.