mrjob
mrjob copied to clipboard
fetch bootstrap logs via SSH
Currently, we fetch bootstrap logs from S3 only. This means we have to wait an extra few minutes for the cluster to terminate before we can find out why bootstrapping failed.
We could get them via SSH as well, but we'd have to rework our SSH logic; rather than fetching from the master node plus possibly all slave nodes, we'd want to look up the relevant nodes' internal IP address with the ListInstances
API call and then specify which nodes to SSH to.
Also, we won't be able to get node ID from the files' path anymore (or even the full SSH URI, which would contain hostname). _match_emr_bootstrap_stderr_path()
would have to be okay with paths without node ID (e.g. /mnt/var/log/bootstrap-actions/1/stderr.gz
). We may or may not need some hacky way of patching the node ID back in, depending on the rest of the code.
This pairs well with #1345 and #1209.
I don't know too much about this part of mrjob, but is this actually still the case? I'm running into another issue where it appears to be trying to pull logs from a killed cluster via SSH and just hangs (ssh_tunnel: True
). Seems like it should pull logs from S3 when a cluster terminates.
Parsing task log: ssh://ec2-184-73-29-5.compute-1.amazonaws.com!10.182.164.109/mnt/var/log/hadoop/userlogs/job_201706260715_0021/attempt_201706260715_0021_m_000028_0/syslog
2017-06-26 17:22:07,813 [INFO] mrjob.logs.mixin: Parsing task log: ssh://ec2-184-73-29-5.compute-1.amazonaws.com!10.182.164.109/mnt/var/log/hadoop/userlogs/job_201706260715_0021/attempt_201706260715_0021_m_000028_0/syslog
Parsing task log: ssh://ec2-184-73-29-5.compute-1.amazonaws.com!10.182.164.109/mnt/var/log/hadoop/userlogs/job_201706260715_0021/attempt_201706260715_0021_m_000027_0/syslog
2017-06-26 17:24:18,885 [INFO] mrjob.logs.mixin: Parsing task log: ssh://ec2-184-73-29-5.compute-1.amazonaws.com!10.182.164.109/mnt/var/log/hadoop/userlogs/job_201706260715_0021/attempt_201706260715_0021_m_000027_0/syslog
Parsing task log: ssh://ec2-184-73-29-5.compute-1.amazonaws.com!10.182.164.109/mnt/var/log/hadoop/userlogs/job_201706260715_0021/attempt_201706260715_0021_m_000026_0/syslog
2017-06-26 17:26:29,957 [INFO] mrjob.logs.mixin: Parsing task log: ssh://ec2-184-73-29-5.compute-1.amazonaws.com!10.182.164.109/mnt/var/log/hadoop/userlogs/job_201706260715_0021/attempt_201706260715_0021_m_000026_0/syslog
Parsing task log: ssh://ec2-184-73-29-5.compute-1.amazonaws.com!10.182.164.109/mnt/var/log/hadoop/userlogs/job_201706260715_0021/attempt_201706260715_0021_m_000025_0/syslog
2017-06-26 17:28:41,029 [INFO] mrjob.logs.mixin: Parsing task log: ssh://ec2-184-73-29-5.compute-1.amazonaws.com!10.182.164.109/mnt/var/log/hadoop/userlogs/job_201706260715_0021/attempt_201706260715_0021_m_000025_0/syslog
Parsing task log: ssh://ec2-184-73-29-5.compute-1.amazonaws.com!10.182.164.109/mnt/var/log/hadoop/userlogs/job_201706260715_0021/attempt_201706260715_0021_m_000024_0/syslog
2017-06-26 17:30:52,101 [INFO] mrjob.logs.mixin: Parsing task log: ssh://ec2-184-73-29-5.compute-1.amazonaws.com!10.182.164.109/mnt/var/log/hadoop/userlogs/job_201706260715_0021/attempt_201706260715_0021_m_000024_0/syslog
Probable cause of failure:
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:372)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:586)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:136)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:76)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:441)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:377)
at org.apache.hadoop.mapred.Child$4.run(Child.java:263)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1140)
at org.apache.hadoop.mapred.Child.main(Child.java:257)
(from lines 9033-9046 of ssh://ec2-184-73-29-5.compute-1.amazonaws.com/mnt/var/log/hadoop/history/done/version-1/ip-10-183-139-22.ec2.internal_1498461344006_/2017/06/26/000000/job_201706260715_0021_1498476306197_hadoop_streamjob79879953537456863.jar)