gohbase icon indicating copy to clipboard operation
gohbase copied to clipboard

Issue while scanning large datasets

Open jonbonazza opened this issue 6 years ago • 12 comments

I am using the new scanner api like so:

func ScanCh(scanner hrpc.Scanner, errCallbacks ...func(row *hrpc.Result, err error)) <-chan *hrpc.Result {
	ch := make(chan *hrpc.Result)
	go func() {
		defer close(ch)
		defer scanner.Close()
		var row *hrpc.Result
		var err error
		for err != io.EOF {
			if row, err = scanner.Next(); err == nil {
				ch <- row
			} else if err != io.EOF {
				for _, f := range errCallbacks {
					f(row, err)
				}
			}
		}
	}()
	return ch
}

After some minutes and several hundred rows being scanned, I am seeing the following error from HBase:

ERRO[0249] failed to close scanner                       err="HBase Java exception org.apache.hadoop.hbase.UnknownScannerException:
org.apache.hadoop.hbase.UnknownScannerException: Name: 1282974, already closed?
        at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2128)
        at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32205)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2034)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107)
        at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
        at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
        at java.lang.Thread.run(Thread.java:745)
" scannerID=1282974

It should also be noted that this error occurs well before all of the rows are scanned, so I never get to scan all rows.

jonbonazza avatar Jul 27 '17 23:07 jonbonazza

@timoha this also occurs when using the scanner API directly and not using a channel. Any idea what is going on here? For context, for each cell in the scan, I am doing a cockroachdb database insertion. Is it possible that this operation is taking too long and the client is timing out?

jonbonazza avatar Jul 31 '17 18:07 jonbonazza

I'll try to investigate soon, been a bit occupied. The error that you are seeing should not affect anything in your code, as it happens in the background. Would be more useful if you provided the error that is returned by call to Next().

timoha avatar Aug 09 '17 23:08 timoha

@timoha Interesting... I thought that was the error that was being returned by the Next() call. Unless, maybe, an io.EOF error is being returned. I'll investigate more once I get the chance.

jonbonazza avatar Aug 17 '17 20:08 jonbonazza

@timoha Sorry for the (very) late reply here. So basically, this is the error that is being returned from the Next() call and then immedately following that, Next() returns io.EOF.

jonbonazza avatar Oct 25 '17 20:10 jonbonazza

@timoha the docs for UnknownScannerException seem to suggest the client isn't "checking in" with the server and the server is closing the connection.

Thrown if a region server is passed an unknown scanner id. Usually means the client has take too long between checkins and so the scanner lease on the serverside has expired OR the serverside is closing down and has cancelled all leases.

jonbonazza avatar Oct 25 '17 21:10 jonbonazza

More information. Looks like when I see this UnknownScannerException, only that one rpc fails. If there are more RPCs to make, then they will continue on, but I will have lost the data (save some partial data that is returned, if any) for the failed RPC. Not really sure how to recover from this.

So my statement a couple comments up is not entirely correct. The immediately following error is not necessarily io.EOF, it just so happened that during that particular scan, there were no more RPCs to make, so the next call to Next did return io.EOF.

jonbonazza avatar Oct 25 '17 23:10 jonbonazza

Yeah, sound like you either take a long time between calling Next() or your regionserver died in the process of scanning.

For the first case, we could implement periodic scanner lease renewal: https://github.com/tsuna/gohbase/blob/master/pb/Client.proto#L285

For the second case, need to spend some time to better handle error cases for scanner and retry in case we get this exception gracefully. I might have some time soon, to take a stab at it.

timoha avatar Nov 06 '17 19:11 timoha

@timoha I was able to confirm that it is the former. In this event will it miss data? If so, I am not sure what I can do here as we have, like, gigs of data that is being processed concurrently, distributed across nodes, but in order to not destroy our memory, we throttle the number of goroutines with a semaphore. If we crank this sempahore up too high, we end up pegging the CPU and stalling the system, so we have to have a balance.

How difficult would it be to implement lease renewal?

jonbonazza avatar Nov 06 '17 21:11 jonbonazza

@timoha We are again encountering this time, we are not doing anything heavy, but we are doing a lot of scanning and inserting. I am worried we are overloading the cluster and I was wondering.. What happens in the event that the HBase cluster is overloaded and a request to the cluster takes a long time? Will the lease still expire because Next() hasn't been called in a bit? If so, what can we do here? (Aside from throw more resources at the HBase cluster.)

jonbonazza avatar May 13 '18 18:05 jonbonazza

Also, @timoha Do you think you could give me a run down of how scanner lease renewal should work? I'd be happy to submit a PR for this, but don't have the necessary understanding, I fear. Is there any documentation on the HBase protocol that I could use to discern such information?

jonbonazza avatar May 15 '18 17:05 jonbonazza

UnknownScannerException is handled gracefully both in AsyncHBase and in the standard HBase client, so maybe we should do that too here.

tsuna avatar Nov 12 '21 14:11 tsuna

Haven't checked AsyncHBase code or standard client code in a while, but previously neither actually handled the case of partial row scanners. The problem is that if the scanner times out in the middle of the row, there's no way to safely (preserving row atomicity) restart scanning from the middle of the row. The best option has always been for clients to explicitly keep track of the last row they've scanned and restart scanning from the beginning of the row when exception happens. That way each client can take care of duplicates their own way.

That being said, maybe there's an API that somehow uses MVCC to properly address this now, but scanners can blow up with OOM if you don't rely on partial row scanning feature.

timoha avatar Nov 13 '21 02:11 timoha

Closing in favour of #91

dethi avatar Mar 28 '23 19:03 dethi