skein icon indicating copy to clipboard operation
skein copied to clipboard

Increase grpc max message size

Open fhoering opened this issue 6 years ago • 8 comments

  • Currently the limit in Python is 4 MB which fails for retrieving large application logs

fhoering avatar Apr 02 '20 17:04 fhoering

@jcrist Any news on that ? What about making it configurable ?

grpc_max_receive_message_length = os.environ.get("SKEIN_MAX_MSG_SIZE", 50 * 1024 * 1024) 
..
 ('grpc.max_receive_message_length', grpc_max_receive_message_length )
..

fhoering avatar May 18 '20 16:05 fhoering

Apologies, things have been a bit chaotic. Thinking about this a bit more, I think the method returning logs should be a streaming output method, so the msg size matters less. We might stream one response per container, which would hopefully stay below this limit. Thoughts?

jcrist avatar May 18 '20 16:05 jcrist

Why not but I think one response by container could still easily go above 4 MB. It turns out even 50 MB seems not enough in some cases. We have jobs that produce logs for 24 hours and each spawned container produces the same amount of logs.

In that case we would even need to stream the response for each container in chunks. But that also means we have to merge it back on client side to have a cleaner API right ? Not sure a list of strings would be a nice API to use.

fhoering avatar May 18 '20 16:05 fhoering

Hmmm, yeah. In that case we may want to stream in fixed bytesize chunks (line-by-line seems too much to me). To the end user I think this should keep the same python api, so we'd have to reassemble in the client before returning.

I'm fine merging this as a stop-gap for now, but I think we should redesign the logs api to avoid this issue entirely.

jcrist avatar May 18 '20 17:05 jcrist

OK. I agree it would be nice not having to specify this setting.

Having a quick look we could change the proto like this :

 rpc getLogs (LogsRequest) returns (stream LogsResponse);

message LogsResponse {
  string key = 1; //the container_id
  string data = 2; // one chunk of the log
}

Then send n LogResponse messages and aggregate this back to the client. Seems easy on Python side to just iterate over the messages and merge them. The only complication is to propagate this all the way down to the yarn api on Java side.

fhoering avatar May 19 '20 17:05 fhoering

👍 That rpc design makes sense to me.

jcrist avatar May 19 '20 18:05 jcrist

OK. I also had a look at LogClient and it could also just produce a stream with Java8 stream API. I will work on it and propose something in another PR.

In the meantime, if you have time, please have a look at: https://github.com/jcrist/skein/pull/213 As this also touches the log api and probably could produce conflicts.

fhoering avatar May 26 '20 11:05 fhoering

Hi folks,

I've been hitting this issue consistently, with my yarn jobs generating around 20MBs of logs.

It looks like there was agreement on merging the stop gap measure proposed in this PR. If that is still the case, can we please merge?

xabriel avatar Jul 01 '22 16:07 xabriel