corral icon indicating copy to clipboard operation
corral copied to clipboard

Word count not counting the whole data

Open estebarb opened this issue 3 years ago • 3 comments

I'm running the word count program over a 86GB dataset. The data is utf8, already sanitized with newlines and spaces. I already know that the total words is around 29000M words. But the resulting output of the word count program sums just 86M words. Also, the logs are full of too many requests errors.

How can I debug why the program is not reading the whole input? It is caused by those too many requests errors? Any workaround? Thanks

estebarb avatar Nov 28 '21 17:11 estebarb

This project is not actively being maintained. That being said, what types of request errors are you seeing in the logs? I'm not sure that worker retry was ever implemented, so that may be the cause of the undercounting.

bcongdon avatar Nov 29 '21 21:11 bcongdon

Thanks! The errors are similar to:

time="2021-11-27T23:07:16-06:00" level=warning msg="Function invocation failed. (Attempt 2 of 3)"
...
time="2021-11-27T23:07:18-06:00" level=error msg="unexpected end of JSON input"
time="2021-11-27T23:07:18-06:00" level=error msg="Error when running mapper 99: TooManyRequestsException: Rate Exceeded.\n{\n  RespMetadata: {\n    StatusCode: 429,\n    RequestID: \"3bffa6f5-1a44-4749-8408-165c4d8881da\"\n  },\n  Message_: \"Rate Exceeded.\",\n  Reason: \"ConcurrentInvocationLimitExceeded\",\n  Type: \"User\"\n}"
time="2021-11-27T23:07:19-06:00" level=warning msg="Function invocation failed. (Attempt 3 of 3)"

estebarb avatar Nov 30 '21 00:11 estebarb

It may be worth trying to configure "maxConcurrency" (ref) to a lower value. It looks like the default is 500, which may be too high, in retrospect.

As for unexpected end of JSON input: It's hard to know for sure, but if that's being emitted from the framework, I'm guessing it'd be from here, where the reducers decode the intermediate map output

bcongdon avatar Nov 30 '21 00:11 bcongdon