ambry GET requests sometimes result in 500 Internal error responses

I use an Ambry instance for saving and serving media files (images and videos), and start having problem with some requests failing with status: 500 Internal Server Error and following exception trace in the log:

    [2017-10-10 16:02:52,366] ERROR Internal error handling request /AAEAAQAAAAAAAAAAAAAAJGExMDlhYWFjLWNjNGYtNDVjOS04ZjliLTUxMDY0MTdhZWY5OQ with method GET (com.github.ambry.rest.NettyResponseChannel)
    com.github.ambry.rest.RestServiceException: com.github.ambry.router.RouterException: Timed out waiting for a response Error: OperationTimedOut
            at com.github.ambry.frontend.AmbryBlobStorageService.submitResponse(AmbryBlobStorageService.java:322)
            at com.github.ambry.frontend.AmbryBlobStorageService$GetCallback.onCompletion(AmbryBlobStorageService.java:786)
            at com.github.ambry.frontend.AmbryBlobStorageService$GetCallback.onCompletion(AmbryBlobStorageService.java:698)
            at com.github.ambry.router.NonBlockingRouter$1.onCompletion(NonBlockingRouter.java:155)
            at com.github.ambry.router.NonBlockingRouter$1.onCompletion(NonBlockingRouter.java:149)
            at com.github.ambry.router.NonBlockingRouter.completeOperation(NonBlockingRouter.java:392)
            at com.github.ambry.router.GetBlobOperation.abort(GetBlobOperation.java:146)
            at com.github.ambry.router.GetBlobOperation.poll(GetBlobOperation.java:263)
            at com.github.ambry.router.GetManager.poll(GetManager.java:155)
            at com.github.ambry.router.NonBlockingRouter$OperationController.pollForRequests(NonBlockingRouter.java:522)
            at com.github.ambry.router.NonBlockingRouter$OperationController.run(NonBlockingRouter.java:575)
            at java.lang.Thread.run(Thread.java:748)
    Caused by: com.github.ambry.router.RouterException: Timed out waiting for a response Error: OperationTimedOut
            at com.github.ambry.router.GetBlobOperation$GetChunk.cleanupExpiredInFlightRequests(GetBlobOperation.java:553)
            at com.github.ambry.router.GetBlobOperation$GetChunk.poll(GetBlobOperation.java:530)
            at com.github.ambry.router.GetBlobOperation.poll(GetBlobOperation.java:232)
            ... 4 more

Ambry version: recent built from commit e1a688eccb040934c3e6e94281da6c2b9826522d
JVM version: 1.8.0_131

Oct 10 '17 13:10 ilyai

@ilyai : Does this happen regularly or one off cases? Was it running in good condition before this patch?

Oct 16 '17 20:10 nsivabalan

on behalf of @ilyai I shall reply. it happens regularly. No, the error was the same on previous patch too.

Oct 26 '17 18:10 rabzu

@rabzu @ilyai as the error message says, this happens when the router sends out request to the datanode but does not receive a response within the preconfigured expiry time. This could happen if requests are either blocked on connections, or if the datanode gets the request but does not send a response in time etc. A feq questions and tips for debugging this:

What is your cluster topology? How many frontend/router nodes do you have? How many datanodes? How many datacenters?
What is the typical size of the blobs that on which GET is being done?
Look at the datanode (ambry server) logs to see if it is getting the requests that are being sent by the router and if there are any errors (you may need to enable trace logging).
Try increasing router.request.timeout.ms in the RouterConfig (default of 2 seconds should actually be plenty, but it will give you more info on the problem).
Check if you have enough number of connections in router.scaling.unit.max.connections.per.port.{plain.text,ssl} configs to process your requests. This should be set to a reasonable value based on the number of datanodes you have and the qps.

Oct 26 '17 19:10 pnarayanan

@pnarayanan Thanks a lot for your detailed response. Currently we have one ambry node for development and one for production. They're used to serve media files: images (~200kb) and videos (up to ~100mb).

I added the following settings:

router.request.timeout.ms=10000
router.scaling.unit.max.connections.per.port.plain.text=20
router.scaling.unit.max.connections.per.port.ssl=20

Also our ambry nodes are accessible with SSL, but we use apache reverse proxy for that. If this might cause issues we could try using ambry's builtin SSL support I think.

Oct 27 '17 08:10 ilyai

@ilyai was this particular issue resolved? Can this be closed?

Nov 27 '17 23:11 vgkholla

ambry ambry copied to clipboard

GET requests sometimes result in 500 Internal error responses

ambry
ambry copied to clipboard