frontera SW stopping on too many HBase retries

My SW is stopping due to an unhandled exception:

Unhandled error in Deferred:


Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/twisted/internet/base.py", line 1194, in run
    self.mainLoop()
  File "/usr/local/lib/python2.7/dist-packages/twisted/internet/base.py", line 1203, in mainLoop
    self.runUntilCurrent()
  File "/usr/local/lib/python2.7/dist-packages/twisted/internet/base.py", line 825, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "/usr/local/lib/python2.7/dist-packages/twisted/internet/task.py", line 213, in __call__
    d = defer.maybeDeferred(self.f, *self.a, **self.kw)
--- <exception caught here> ---
  File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 150, in maybeDeferred
    result = f(*args, **kw)
  File "/media/sf_distributed_llshrimp/debug/frontera/frontera/worker/strategy.py", line 80, in work
    self.states.fetch(fingerprints)
  File "/media/sf_distributed_llshrimp/debug/frontera/frontera/contrib/backends/hbase.py", line 306, in fetch
    records = table.rows(keys, columns=['s:state'])
  File "/usr/local/lib/python2.7/dist-packages/happybase/table.py", line 155, in rows
    self.name, rows, columns, {})
  File "/usr/local/lib/python2.7/dist-packages/happybase/hbase/Hbase.py", line 1358, in getRowsWithColumns
    return self.recv_getRowsWithColumns()
  File "/usr/local/lib/python2.7/dist-packages/happybase/hbase/Hbase.py", line 1384, in recv_getRowsWithColumns
    raise result.io
happybase.hbase.ttypes.IOError: IOError(_message='org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 27766 actions: IOException: 27766 times, \n\tat org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:228)\n\tat org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$1700(AsyncProcess.java:208)\n\tat org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.getErrors(AsyncProcess.java:1594)\n\tat org.apache.hadoop.hbase.client.HTable.batch(HTable.java:936)\n\tat org.apache.hadoop.hbase.client.HTable.batch(HTable.java:950)\n\tat org.apache.hadoop.hbase.client.HTable.get(HTable.java:911)\n\tat org.apache.hadoop.hbase.thrift.ThriftServerRunner$HBaseHandler.getRowsWithColumnsTs(ThriftServerRunner.java:1107)\n\tat org.apache.hadoop.hbase.thrift.ThriftServerRunner$HBaseHandler.getRowsWithColumns(ThriftServerRunner.java:1063)\n\tat sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)\n\tat sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat java.lang.reflect.Method.invoke(Method.java:606)\n\tat org.apache.hadoop.hbase.thrift.HbaseHandlerMetricsProxy.invoke(HbaseHandlerMetricsProxy.java:67)\n\tat com.sun.proxy.$Proxy9.getRowsWithColumns(Unknown Source)\n\tat org.apache.hadoop.hbase.thrift.generated.Hbase$Processor$getRowsWithColumns.getResult(Hbase.java:4262)\n\tat org.apache.hadoop.hbase.thrift.generated.Hbase$Processor$getRowsWithColumns.getResult(Hbase.java:4246)\n\tat org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)\n\tat org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)\n\tat org.apache.hadoop.hbase.thrift.TBoundedThreadPoolServer$ClientConnnection.run(TBoundedThreadPoolServer.java:289)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)\n\tat java.lang.Thread.run(Thread.java:745)\n')

After restarting the SW it is working again (but the consumption seems to work at a much smaller rate)

Jan 20 '16 18:01 lljrsr

@lljrsr Please check HBase logs, it reports IOException on HBase side. I'm not sure if it's related to SW code, but would be nice to investigate.

Jan 21 '16 19:01 sibiryakov

Yes HBase is the root of the problem. The way I see it, is that HBase fails, which in turn makes the strategy worker fail. The strategy worker should then automatically exit and throw an error, instead of just stopping (otherwise it will not get automatically restarted by upstart or similiar).

Jan 22 '16 10:01 lljrsr

Oh, that's interesting. What if we re-establish connection to HBase, after some time and continue operation?

Jan 22 '16 15:01 sibiryakov

I think handling this error would solve the problem. I would handle it by just exiting.

Jan 25 '16 16:01 lljrsr

@lljrsr Loosing connection to HBase is common unfortunately. Not only because of physical network problems, but also because of HBase internal complexity. We could solve it by exiting, that's for sure. But think of a broad picture:

What if people don't use any process management like upstart? Constantly exiting workers would cause them a lot of maintenance effort.
Exiting and starting for process is computationally expensive: loading of python interpreter, allocating memory, ports, establishing connections (to all kafka brokers, BTW), flushing of state cache. My opinion is we should exit only in case of fatal error, where continuing isn't possible.

Jan 26 '16 17:01 sibiryakov

You are right. Reconnecting to HBase would probably be the better idea here.

Jan 26 '16 19:01 lljrsr