cml icon indicating copy to clipboard operation
cml copied to clipboard

runner: some aws instances don't shut down and keep running forever

Open courentin opened this issue 2 years ago • 7 comments

When running cml runner, I've noticed a couple of times that some aws instances were not properly terminated when jobs were finished. However, the gihub runner was marked as offline.

I can't reproduce it easily, it seems to happen in a particular scenario that I can't isolate for now. Also I don't have the bandwidth to investigate more for now.

However, as a broader issue here, I'm wondering if the instance termination is resilient enough to diverse failures. Given that cml is made for ML workflows, we expect users to run big/expensive instances. Thus, cml should ensure that no matter what happen (github action's down, cml bug, particular edge case with canceling workflows, etc.) instances must be shut down at some point.

When cml didn't existed, we used to start instances manually and figured out it was easy to forget about them. To fix that, we developed an EC2 instance garbage collector that terminated instances with low CPU usage for a certain amount of time. If that kind of stuff can be integrated to cml, that would be awesome.

courentin avatar Aug 31 '22 18:08 courentin

Thanks, @courentin I have had 2 recent occurrences of cml not terminating properly internally as well and I believe there may be an issue with how we are determining idleness. Additionally (depending on use), there may be instances with expired credentials, thus unable to delete themselves (ex. using OIDC).

  • Can you share the gist of the flags you are using with cml runner where the instances aren't terminating?
  • I have an internal action that checks AWS for still present cml instances; with a bit of clean-up, we could probably share this as a composite action. Would something like this be helpful as simple additional tool/action?

I often add this example line to help link instances back to there ci runs: --cloud-metadata="actions_link=https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }}"

dacbd avatar Aug 31 '22 19:08 dacbd

Here is the exact cml runner options that we use:

cml runner
--name=${{ steps.cml_name.outputs.id }}
--cloud=aws
--cloud-region=${{ matrix.aws_region }}b
--cloud-type=g4dn.xlarge
--labels=cml,${{ matrix.aws_region }},${{ steps.cml_name.outputs.id }}
--cloud-metadata="Project=speech-models"
--cloud-metadata="Service=ml-infra"
--cloud-metadata="CreatedBy=cml"
--reuse-idle
--idle-timeout=600

I have an internal action that checks AWS for still present cml instances; with a bit of clean-up, we could probably share this as a composite action. Would something like this be helpful as simple additional tool/action?

That would be awesome!

I often add this example line to help link instances back to there ci runs: --cloud-metadata="actions_link=https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }}"

Nice, will do the same!

courentin avatar Sep 01 '22 13:09 courentin

@courentin do you by chance have multiple runners created in parallel with the snippet you shared?

  • How many instances/runners are created?
  • What is the general format of ${{ steps.cml_name.outputs.id }} ?
  • Do you only have one stale instance from this type of invocation or multiple?

dacbd avatar Sep 05 '22 18:09 dacbd

@courentin do you by chance have multiple runners created in parallel with the snippet you shared?

Yes we can if two PR are open at the same time. But one PR should create one runner.

How many instances/runners are created?

1 per PR

What is the general format of ${{ steps.cml_name.outputs.id }} ?

The same as the default one, we use it to be able to use it in some steps after

Do you only have one stale instance from this type of invocation or multiple?

Good question, as I'm not able to reproduce it now I can't be 100% sure, but I'd say 1

courentin avatar Sep 07 '22 18:09 courentin

Just figured out I have 8 instances running and github tells me they are offline. They all reached their storage limit and jobs failed with the following error, thus I suspect this is why they become unavailable to the github runner:

System.IO.IOException: No space left on device : '/tmp/tmp.LR2kETHcwX/.cml/cml-by6mft0ge1/_diag/Worker_20220919-124453-utc.log'
   at System.IO.RandomAccess.WriteAtOffset(SafeFileHandle handle, ReadOnlySpan`1 buffer, Int64 fileOffset)
   at System.IO.Strategies.BufferedFileStreamStrategy.FlushWrite()
   at System.IO.StreamWriter.Flush(Boolean flushStream, Boolean flushEncoder)
   at System.Diagnostics.TextWriterTraceListener.Flush()
   at GitHub.Runner.Common.HostTraceListener.WriteHeader(String source, TraceEventType eventType, Int32 id)
   at GitHub.Runner.Common.HostTraceListener.TraceEvent(TraceEventCache eventCache, String source, TraceEventType eventType, Int32 id, String message)
   at System.Diagnostics.TraceSource.TraceEvent(TraceEventType eventType, Int32 id, String message)
   at GitHub.Runner.Worker.Worker.RunAsync(String pipeIn, String pipeOut)
   at GitHub.Runner.Worker.Program.MainAsync(IHostContext context, String[] args)
System.IO.IOException: No space left on device : '/tmp/tmp.LR2kETHcwX/.cml/cml-by6mft0ge1/_diag/Worker_20220919-124453-utc.log'
   at System.IO.RandomAccess.WriteAtOffset(SafeFileHandle handle, ReadOnlySpan`1 buffer, Int64 fileOffset)
   at System.IO.Strategies.BufferedFileStreamStrategy.FlushWrite()
   at System.IO.StreamWriter.Flush(Boolean flushStream, Boolean flushEncoder)
   at System.Diagnostics.TextWriterTraceListener.Flush()
   at GitHub.Runner.Common.HostTraceListener.WriteHeader(String source, TraceEventType eventType, Int32 id)
   at GitHub.Runner.Common.HostTraceListener.TraceEvent(TraceEventCache eventCache, String source, TraceEventType eventType, Int32 id, String message)
   at System.Diagnostics.TraceSource.TraceEvent(TraceEventType eventType, Int32 id, String message)
   at GitHub.Runner.Common.Tracing.Error(Exception exception)
   at GitHub.Runner.Worker.Program.MainAsync(IHostContext context, String[] args)
Unhandled exception. System.IO.IOException: No space left on device : '/tmp/tmp.LR2kETHcwX/.cml/cml-by6mft0ge1/_diag/Worker_20220919-124453-utc.log'
   at System.IO.RandomAccess.WriteAtOffset(SafeFileHandle handle, ReadOnlySpan`1 buffer, Int64 fileOffset)
   at System.IO.Strategies.BufferedFileStreamStrategy.FlushWrite()
   at System.IO.StreamWriter.Flush(Boolean flushStream, Boolean flushEncoder)
   at System.Diagnostics.TextWriterTraceListener.Flush()
   at System.Diagnostics.TraceSource.Flush()
   at GitHub.Runner.Common.TraceManager.Dispose(Boolean disposing)
   at GitHub.Runner.Common.TraceManager.Dispose()
   at GitHub.Runner.Common.HostContext.Dispose(Boolean disposing)
   at GitHub.Runner.Common.HostContext.Dispose()
   at GitHub.Runner.Worker.Program.Main(String[] args)

courentin avatar Sep 20 '22 09:09 courentin

I wonder if using systemd slices to prevent resource exhaustion as @dacbd suggested could help fix this issue. 🤔

0x2b3bfa0 avatar Sep 20 '22 12:09 0x2b3bfa0

Indeed, for my bits of experimenting, I think that the easiest way to handle this on our end, would be by creating a 500 MB file during provisioning that we mount as a disk to the terraform dir. The terraform destroy fails because it can't create a new file. Having a premade file mounted as a block device for that dir should allow the destroy to go through regardless of the user filling the disk.

@courentin your input here would be helpful to increase some visibility (trying to GitHub to improve some messaging) https://github.com/orgs/community/discussions/30440

dacbd avatar Sep 20 '22 20:09 dacbd

@dacbd Unfortunately I never encountered the message "The self-hosted runner"

courentin avatar Sep 22 '22 15:09 courentin

Sorry for the belated reply, @courentin.

Although there isn't an easy way of making sure that no instance is left running after a crash, we'll try to fix this particular edge case, where instances run out of disk space.

In the meantime, and once you've identified your storage requirements, it might be a good idea to explicitly set cml runner --cloud-hdd-size to a more generous value.

0x2b3bfa0 avatar Sep 30 '22 14:09 0x2b3bfa0

Hello! I'm not using cml often but the few times our team has been using it left some costly EC2 instance running forever. And since I'm now monitoring the storage space, I can affirm that it's not the cause of the issue.

I totally understand this is not an easy issue here, but I still guess cml should be more resilient to whatever happens in the instance.

I'd be super happy to provide any logs when an instance keeps running but I don't really know which ones.

[EDIT]: I seems like we reached the RAM limit that's why the instance became unavailable Is there a way to avoid shut down the instance even in this case?

courentin avatar Jan 05 '23 22:01 courentin

We are currently working on an error case that is preventing runners from properly terminating. Any logs and context you can provide will help us determine what is going on.

On Thu, Jan 5, 2023, 14:46 Corentin Hembise @.***> wrote:

Hello! I'm not using cml often but the few times our team has been using it left some costly EC2 instance running forever. And since I'm now monitoring the storage space, I can affirm that it's not the cause of the issue.

I totally understand this is not an easy issue here, but I still guess cml should be more resilient to whatever happens in the instance.

I'd be super happy to provide any logs when an instance keeps running but I don't really know which ones

— Reply to this email directly, view it on GitHub https://github.com/iterative/cml/issues/1138#issuecomment-1372888057, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIN7MYNBPNPX3EPUK37Z2TWQ5FLFANCNFSM6AAAAAAQBUVVZQ . You are receiving this because you were mentioned.Message ID: @.***>

dacbd avatar Jan 06 '23 00:01 dacbd

@courentin for your edit, there isn't much we can do about resource starving like that.

How well this gets handled is primarily based on how nicely what you are running reacts to running out of memory. From my experience, many of these ML frameworks can crash rather violently when they run out of memory. You could try executing the training script manually with a docker run --rm -it -v $PWD:/opt/your_project container_name I have used this handle memory crashes in a nicer manner within github actions workflows (this can still crash in an unrecoverable state though)

The best answer is to do some profiling on your training and give yourself maybe 30% headroom for memory.

dacbd avatar Jan 06 '23 00:01 dacbd