mrjob
mrjob copied to clipboard
application error triggers log parsing quota issue on Dataproc
When running on Google Dataproc (1.3), if some error occurs, MRJob (0.6.6) will parse logs, search for the container that caused the error and bubble up that error log to the user.
However, it appears that there is a quota on how many log requests can be placed per minute on Dataproc (in my case, 60 per minute which cannot be increased). The MRJob requests are not throttled and easily go above that limit.
Then it is the quota issue (not the initial) error that gets reported to the user. Something like:
google.api_core.exceptions.ResourceExhausted: 429 Quota exceeded for quota metric 'logging.googleapis.com/read_requests' and limit 'ReadRequestsPerMinutePerProject' of service 'logging.googleapis.com' for consumer 'project_number:...'.
if authentified using a service account with $GOOGLE_APPLICATION_CREDENTIALS
or:
google.api_core.exceptions.ResourceExhausted: 429 Quota exceeded for quota metric 'logging.googleapis.com/read_requests' and limit 'ReadRequestsPerMinutePerUser' of service 'logging.googleapis.com' for consumer 'project_number:...'.
if authentified using a regular user with "gcloud auth login"
The call stack is:
File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/mrjob/runner.py", line 499, in run
self._run()
File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/mrjob/dataproc.py", line 464, in _run
self._run_steps()
File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/mrjob/dataproc.py", line 727, in _run_steps
job_id, step_num=step_num, num_steps=total_steps)
File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/mrjob/dataproc.py", line 808, in _wait_for_step_to_complete
error = self._pick_error(log_interpretation, step_type)
File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/mrjob/logs/mixin.py", line 131, in _pick_error
log_interpretation, step_type, error_attempt_ids)
File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/mrjob/dataproc.py", line 936, in _interpret_task_logs
application_id, step_type, partial)
File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/mrjob/dataproc.py", line 947, in _task_log_interpretation
application_id, container_id, step_type))
File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/mrjob/logs/task.py", line 444, in _parse_task_syslog_records
for record in records:
File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/mrjob/dataproc.py", line 1376, in _log_entries_to_log4j
for entry in entries:
File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/google/api_core/page_iterator.py", line 199, in _items_iter
for page in self._page_iter(increment=False):
File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/google/api_core/page_iterator.py", line 230, in _page_iter
page = self._next_page()
File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/google/api_core/page_iterator.py", line 512, in _next_page
response = self._method(self._request)
File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/google/api_core/gapic_v1/method.py", line 139, in __call__
return wrapped_func(*args, **kwargs)
File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/google/api_core/retry.py", line 260, in retry_wrapped_func
on_error=on_error,
File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/google/api_core/retry.py", line 177, in retry_target
return target()
File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/google/api_core/timeout.py", line 206, in func_with_timeout
return func(*args, **kwargs)
File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/google/api_core/grpc_helpers.py", line 61, in error_remapped_callable
six.raise_from(exceptions.from_grpc_error(exc), exc)
File "<string>", line 3, in raise_from
google.api_core.exceptions.ResourceExhausted: 429 Quota exceeded for quota metric 'logging.googleapis.com/read_requests' and limit 'ReadRequestsPerMinutePerProject' of service 'logging.googleapis.com' for consumer 'project_number:1001190150715'.
I tried to introduce some naive throttling with:
import time
time.sleep(0.1) # cap frequency to 10/s
near lines 993 and 1376 of mrjob/dataproc.py but that did not solve the issue (is this code multithreaded?)
just realized Google quota is 60/minute (not 60/second) and I should have tried instead:
import time
time.sleep(1.5) # cap frequency to 60/1.5 = 40/minute
I'm not sure why but waiting 1.5s does not solve the issue either, still running out of logging quota (same call stack) and the primary error that caused the job to fail is masked...
another error, probably un-related looked like this:
INFO:mrjob.logs.mixin:Scanning logs for probable cause of failure...
INFO:mrjob.dataproc:Scanning node manager logs for IDs of failed tasks...
INFO:mrjob.dataproc: reading syslog...
INFO:mrjob.dataproc: reading syslog...
INFO:mrjob.dataproc: reading syslog...
INFO:mrjob.dataproc: reading syslog...
...
INFO:mrjob.dataproc: reading syslog...
INFO:mrjob.dataproc: reading syslog...
INFO:mrjob.dataproc: reading stderr log...
ERROR:mrjob.dataproc:Probable cause of failure:
Broken pipe
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:326)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at org.apache.hadoop.streaming.io.TextInputWriter.writeUTF8(TextInputWriter.java:72)
at org.apache.hadoop.streaming.io.TextInputWriter.writeValue(TextInputWriter.java:51)
at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:106)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:459)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:177)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1886)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:171)
caused by:
+ /tmp/myvenv/bin/python3 create_snippets.py --step-num=0 --mapper
while reading input from lines 1-7236056 of gs://mybucket/myfolder/part-29441.gz
INFO:mrjob.dataproc:Attempting to terminate cluster
ERROR:mrjob.dataproc:404 Not found: Cluster projects/gproject-123456/regions/us-west1/clusters/mrjob-us-west1-a19f95931ccbbba8
Traceback (most recent call last):
File "create_snippets_run.py", line 13, in <module>
runner.run()
File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/mrjob/runner.py", line 499, in run
self._run()
File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/mrjob/dataproc.py", line 464, in _run
self._run_steps()
File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/mrjob/dataproc.py", line 727, in _run_steps
job_id, step_num=step_num, num_steps=total_steps)
File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/mrjob/dataproc.py", line 818, in _wait_for_step_to_complete
step_num=step_num, num_steps=num_steps)
mrjob.step.StepFailedException: Step 1 of 1 failed
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/google/api_core/grpc_helpers.py", line 59, in error_remapped_callable
return callable_(*args, **kwargs)
File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/grpc/_channel.py", line 533, in __call__
return _end_unary_response_blocking(state, call, False, None)
File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/grpc/_channel.py", line 467, in _end_unary_response_blocking
raise _Rendezvous(state, None, None, deadline)
grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
status = StatusCode.NOT_FOUND
details = "Not found: Cluster projects/gproject-123456/regions/us-west1/clusters/mrjob-us-west1-a19f95931ccbbba8"
debug_error_string = "{"created":"@1542898706.834910588","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1017,"grpc_message":"Not found: Cluster projects/gproject-123456/regions/us-west1/clusters/mrjob-us-west1-a19f95931ccbbba8","grpc_status":5}"
>
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/mrjob/dataproc.py", line 613, in _cleanup_cluster
self._delete_cluster(self._cluster_id)
File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/mrjob/dataproc.py", line 1255, in _delete_cluster
**self._project_id_and_region()
File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/google/cloud/dataproc_v1beta2/gapic/cluster_controller_client.py", line 523, in delete_cluster
request, retry=retry, timeout=timeout, metadata=metadata)
File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/google/api_core/gapic_v1/method.py", line 139, in __call__
return wrapped_func(*args, **kwargs)
File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/google/api_core/retry.py", line 260, in retry_wrapped_func
on_error=on_error,
File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/google/api_core/retry.py", line 177, in retry_target
return target()
File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/google/api_core/timeout.py", line 206, in func_with_timeout
return func(*args, **kwargs)
File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/google/api_core/grpc_helpers.py", line 61, in error_remapped_callable
six.raise_from(exceptions.from_grpc_error(exc), exc)
File "<string>", line 3, in raise_from
google.api_core.exceptions.NotFound: 404 Not found: Cluster projects/gproject-123456/regions/us-west1/clusters/mrjob-us-west1-a19f95931ccbbba8
Traceback (most recent call last):gproject-123456
File "create_snippets_run.py", line 13, in <module>
runner.run()
File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/mrjob/runner.py", line 499, in run
self._run()
File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/mrjob/dataproc.py", line 464, in _run
self._run_steps()
File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/mrjob/dataproc.py", line 727, in _run_steps
job_id, step_num=step_num, num_steps=total_steps)
File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/mrjob/dataproc.py", line 818, in _wait_for_step_to_complete
step_num=step_num, num_steps=num_steps)
mrjob.step.StepFailedException: Step 1 of 1 failed
it looks like fetching the error logs took several minutes and the cluster had been torn down during that time, making the logs inaccessible and causing another error to bubble up to the user
Thanks for letting me know. Will get in touch with folks at Google about the best way to handle this.