mrjob icon indicating copy to clipboard operation
mrjob copied to clipboard

application error triggers log parsing quota issue on Dataproc

Open erpic opened this issue 6 years ago • 4 comments

When running on Google Dataproc (1.3), if some error occurs, MRJob (0.6.6) will parse logs, search for the container that caused the error and bubble up that error log to the user.

However, it appears that there is a quota on how many log requests can be placed per minute on Dataproc (in my case, 60 per minute which cannot be increased). The MRJob requests are not throttled and easily go above that limit.

Then it is the quota issue (not the initial) error that gets reported to the user. Something like:

google.api_core.exceptions.ResourceExhausted: 429 Quota exceeded for quota metric 'logging.googleapis.com/read_requests' and limit 'ReadRequestsPerMinutePerProject' of service 'logging.googleapis.com' for consumer 'project_number:...'.

if authentified using a service account with $GOOGLE_APPLICATION_CREDENTIALS

or:

google.api_core.exceptions.ResourceExhausted: 429 Quota exceeded for quota metric 'logging.googleapis.com/read_requests' and limit 'ReadRequestsPerMinutePerUser' of service 'logging.googleapis.com' for consumer 'project_number:...'.

if authentified using a regular user with "gcloud auth login"

The call stack is:

  File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/mrjob/runner.py", line 499, in run
    self._run()
  File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/mrjob/dataproc.py", line 464, in _run
    self._run_steps()
  File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/mrjob/dataproc.py", line 727, in _run_steps
    job_id, step_num=step_num, num_steps=total_steps)
  File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/mrjob/dataproc.py", line 808, in _wait_for_step_to_complete
    error = self._pick_error(log_interpretation, step_type)
  File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/mrjob/logs/mixin.py", line 131, in _pick_error
    log_interpretation, step_type, error_attempt_ids)
  File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/mrjob/dataproc.py", line 936, in _interpret_task_logs
    application_id, step_type, partial)
  File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/mrjob/dataproc.py", line 947, in _task_log_interpretation
    application_id, container_id, step_type))
  File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/mrjob/logs/task.py", line 444, in _parse_task_syslog_records
    for record in records:
  File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/mrjob/dataproc.py", line 1376, in _log_entries_to_log4j
    for entry in entries:
  File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/google/api_core/page_iterator.py", line 199, in _items_iter
    for page in self._page_iter(increment=False):
  File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/google/api_core/page_iterator.py", line 230, in _page_iter
    page = self._next_page()
  File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/google/api_core/page_iterator.py", line 512, in _next_page
    response = self._method(self._request)
  File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/google/api_core/gapic_v1/method.py", line 139, in __call__
    return wrapped_func(*args, **kwargs)
  File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/google/api_core/retry.py", line 260, in retry_wrapped_func
    on_error=on_error,
  File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/google/api_core/retry.py", line 177, in retry_target
    return target()
  File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/google/api_core/timeout.py", line 206, in func_with_timeout
    return func(*args, **kwargs)
  File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/google/api_core/grpc_helpers.py", line 61, in error_remapped_callable
    six.raise_from(exceptions.from_grpc_error(exc), exc)
  File "<string>", line 3, in raise_from
google.api_core.exceptions.ResourceExhausted: 429 Quota exceeded for quota metric 'logging.googleapis.com/read_requests' and limit 'ReadRequestsPerMinutePerProject' of service 'logging.googleapis.com' for consumer 'project_number:1001190150715'.

I tried to introduce some naive throttling with:

import time
time.sleep(0.1) # cap frequency to 10/s

near lines 993 and 1376 of mrjob/dataproc.py but that did not solve the issue (is this code multithreaded?)

erpic avatar Nov 22 '18 08:11 erpic

just realized Google quota is 60/minute (not 60/second) and I should have tried instead:

import time
time.sleep(1.5) # cap frequency to 60/1.5 = 40/minute

erpic avatar Nov 22 '18 08:11 erpic

I'm not sure why but waiting 1.5s does not solve the issue either, still running out of logging quota (same call stack) and the primary error that caused the job to fail is masked...

erpic avatar Nov 22 '18 10:11 erpic

another error, probably un-related looked like this:

INFO:mrjob.logs.mixin:Scanning logs for probable cause of failure...
INFO:mrjob.dataproc:Scanning node manager logs for IDs of failed tasks...
INFO:mrjob.dataproc:    reading syslog...
INFO:mrjob.dataproc:    reading syslog...
INFO:mrjob.dataproc:    reading syslog...
INFO:mrjob.dataproc:    reading syslog...

...


INFO:mrjob.dataproc:    reading syslog...
INFO:mrjob.dataproc:    reading syslog...
INFO:mrjob.dataproc:    reading stderr log...
ERROR:mrjob.dataproc:Probable cause of failure:

Broken pipe
        at java.io.FileOutputStream.writeBytes(Native Method)
        at java.io.FileOutputStream.write(FileOutputStream.java:326)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
        at java.io.DataOutputStream.write(DataOutputStream.java:107)
        at org.apache.hadoop.streaming.io.TextInputWriter.writeUTF8(TextInputWriter.java:72)
        at org.apache.hadoop.streaming.io.TextInputWriter.writeValue(TextInputWriter.java:51)
        at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:106)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
        at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:459)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:177)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1886)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:171)

caused by:

+ /tmp/myvenv/bin/python3 create_snippets.py --step-num=0 --mapper

while reading input from lines 1-7236056 of gs://mybucket/myfolder/part-29441.gz


INFO:mrjob.dataproc:Attempting to terminate cluster
ERROR:mrjob.dataproc:404 Not found: Cluster projects/gproject-123456/regions/us-west1/clusters/mrjob-us-west1-a19f95931ccbbba8
Traceback (most recent call last):
  File "create_snippets_run.py", line 13, in <module>
    runner.run()
  File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/mrjob/runner.py", line 499, in run
    self._run()
  File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/mrjob/dataproc.py", line 464, in _run
    self._run_steps()
  File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/mrjob/dataproc.py", line 727, in _run_steps
    job_id, step_num=step_num, num_steps=total_steps)
  File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/mrjob/dataproc.py", line 818, in _wait_for_step_to_complete
    step_num=step_num, num_steps=num_steps)
mrjob.step.StepFailedException: Step 1 of 1 failed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/google/api_core/grpc_helpers.py", line 59, in error_remapped_callable
    return callable_(*args, **kwargs)
  File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/grpc/_channel.py", line 533, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/grpc/_channel.py", line 467, in _end_unary_response_blocking
    raise _Rendezvous(state, None, None, deadline)
grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
        status = StatusCode.NOT_FOUND
        details = "Not found: Cluster projects/gproject-123456/regions/us-west1/clusters/mrjob-us-west1-a19f95931ccbbba8"
        debug_error_string = "{"created":"@1542898706.834910588","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1017,"grpc_message":"Not found: Cluster projects/gproject-123456/regions/us-west1/clusters/mrjob-us-west1-a19f95931ccbbba8","grpc_status":5}"
>

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/mrjob/dataproc.py", line 613, in _cleanup_cluster
    self._delete_cluster(self._cluster_id)
  File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/mrjob/dataproc.py", line 1255, in _delete_cluster
    **self._project_id_and_region()
  File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/google/cloud/dataproc_v1beta2/gapic/cluster_controller_client.py", line 523, in delete_cluster
    request, retry=retry, timeout=timeout, metadata=metadata)
  File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/google/api_core/gapic_v1/method.py", line 139, in __call__
    return wrapped_func(*args, **kwargs)
  File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/google/api_core/retry.py", line 260, in retry_wrapped_func
    on_error=on_error,
  File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/google/api_core/retry.py", line 177, in retry_target
    return target()
  File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/google/api_core/timeout.py", line 206, in func_with_timeout
    return func(*args, **kwargs)
  File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/google/api_core/grpc_helpers.py", line 61, in error_remapped_callable
    six.raise_from(exceptions.from_grpc_error(exc), exc)
  File "<string>", line 3, in raise_from
google.api_core.exceptions.NotFound: 404 Not found: Cluster projects/gproject-123456/regions/us-west1/clusters/mrjob-us-west1-a19f95931ccbbba8
Traceback (most recent call last):gproject-123456
  File "create_snippets_run.py", line 13, in <module>
    runner.run()
  File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/mrjob/runner.py", line 499, in run
    self._run()
  File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/mrjob/dataproc.py", line 464, in _run
    self._run_steps()
  File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/mrjob/dataproc.py", line 727, in _run_steps
    job_id, step_num=step_num, num_steps=total_steps)
  File "/home/myuser/myproject/myvenv/lib/python3.5/site-packages/mrjob/dataproc.py", line 818, in _wait_for_step_to_complete
    step_num=step_num, num_steps=num_steps)
mrjob.step.StepFailedException: Step 1 of 1 failed

it looks like fetching the error logs took several minutes and the cluster had been torn down during that time, making the logs inaccessible and causing another error to bubble up to the user

erpic avatar Nov 23 '18 08:11 erpic

Thanks for letting me know. Will get in touch with folks at Google about the best way to handle this.

coyotemarin avatar Dec 06 '18 00:12 coyotemarin