gohbase
gohbase copied to clipboard
Issue while scanning large datasets
I am using the new scanner api like so:
func ScanCh(scanner hrpc.Scanner, errCallbacks ...func(row *hrpc.Result, err error)) <-chan *hrpc.Result {
ch := make(chan *hrpc.Result)
go func() {
defer close(ch)
defer scanner.Close()
var row *hrpc.Result
var err error
for err != io.EOF {
if row, err = scanner.Next(); err == nil {
ch <- row
} else if err != io.EOF {
for _, f := range errCallbacks {
f(row, err)
}
}
}
}()
return ch
}
After some minutes and several hundred rows being scanned, I am seeing the following error from HBase:
ERRO[0249] failed to close scanner err="HBase Java exception org.apache.hadoop.hbase.UnknownScannerException:
org.apache.hadoop.hbase.UnknownScannerException: Name: 1282974, already closed?
at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2128)
at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32205)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2034)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107)
at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
at java.lang.Thread.run(Thread.java:745)
" scannerID=1282974
It should also be noted that this error occurs well before all of the rows are scanned, so I never get to scan all rows.
@timoha this also occurs when using the scanner API directly and not using a channel. Any idea what is going on here? For context, for each cell in the scan, I am doing a cockroachdb database insertion. Is it possible that this operation is taking too long and the client is timing out?
I'll try to investigate soon, been a bit occupied. The error that you are seeing should not affect anything in your code, as it happens in the background. Would be more useful if you provided the error that is returned by call to Next().
@timoha Interesting... I thought that was the error that was being returned by the Next() call. Unless, maybe, an io.EOF
error is being returned. I'll investigate more once I get the chance.
@timoha Sorry for the (very) late reply here.
So basically, this is the error that is being returned from the Next()
call and then immedately following that, Next() returns io.EOF
.
@timoha the docs for UnknownScannerException
seem to suggest the client isn't "checking in" with the server and the server is closing the connection.
Thrown if a region server is passed an unknown scanner id. Usually means the client has take too long between checkins and so the scanner lease on the serverside has expired OR the serverside is closing down and has cancelled all leases.
More information. Looks like when I see this UnknownScannerException, only that one rpc fails. If there are more RPCs to make, then they will continue on, but I will have lost the data (save some partial data that is returned, if any) for the failed RPC. Not really sure how to recover from this.
So my statement a couple comments up is not entirely correct. The immediately following error is not necessarily io.EOF
, it just so happened that during that particular scan, there were no more RPCs to make, so the next call to Next did return io.EOF
.
Yeah, sound like you either take a long time between calling Next() or your regionserver died in the process of scanning.
For the first case, we could implement periodic scanner lease renewal: https://github.com/tsuna/gohbase/blob/master/pb/Client.proto#L285
For the second case, need to spend some time to better handle error cases for scanner and retry in case we get this exception gracefully. I might have some time soon, to take a stab at it.
@timoha I was able to confirm that it is the former. In this event will it miss data? If so, I am not sure what I can do here as we have, like, gigs of data that is being processed concurrently, distributed across nodes, but in order to not destroy our memory, we throttle the number of goroutines with a semaphore. If we crank this sempahore up too high, we end up pegging the CPU and stalling the system, so we have to have a balance.
How difficult would it be to implement lease renewal?
@timoha We are again encountering this time, we are not doing anything heavy, but we are doing a lot of scanning and inserting. I am worried we are overloading the cluster and I was wondering.. What happens in the event that the HBase cluster is overloaded and a request to the cluster takes a long time? Will the lease still expire because Next() hasn't been called in a bit? If so, what can we do here? (Aside from throw more resources at the HBase cluster.)
Also, @timoha Do you think you could give me a run down of how scanner lease renewal should work? I'd be happy to submit a PR for this, but don't have the necessary understanding, I fear. Is there any documentation on the HBase protocol that I could use to discern such information?
UnknownScannerException
is handled gracefully both in AsyncHBase and in the standard HBase client, so maybe we should do that too here.
Haven't checked AsyncHBase code or standard client code in a while, but previously neither actually handled the case of partial row scanners. The problem is that if the scanner times out in the middle of the row, there's no way to safely (preserving row atomicity) restart scanning from the middle of the row. The best option has always been for clients to explicitly keep track of the last row they've scanned and restart scanning from the beginning of the row when exception happens. That way each client can take care of duplicates their own way.
That being said, maybe there's an API that somehow uses MVCC to properly address this now, but scanners can blow up with OOM if you don't rely on partial row scanning feature.
Closing in favour of #91