metaflow
metaflow copied to clipboard
Remote debugging on kubernetes
Hi there
I have an idea and some lines of code, which I would like to propose/discuss with you @saikonen
Identified shortcoming
At the moment metaflow
can not be easily used to make remote debugging possible. - With remote debugging I mean on the pod/container within the Kubernetes cluster. - Instead we have to debug locally. This has the obvious disadvantage that potential hardware constraints can kick in. If we resume a failed step and load the to-be-debug step with input data from e.g. S3 we require sufficient memory on the client PC (assuming that the input data fits into client memory). Another example, if in the to-be-debugged-step we require GPU hardware (or any other scarce resource) we basically imply need of those resources on the client PC which not necessarily is available.
At the moment I do not see a good way to overcome this problem with already built in functionality.
Idea
If we could leverage debugpy
and VS code remote-debug capabilities we could attach our local debug session to a remote-debug server as long as we make appropriate port-forwarding etc. possible
PoC
I used for testing purpose @conda_base(libraries={'debugpy':'1.6.7'})
In kubernetes_job.py
line:150 I added in the V1Container
specification
ports=[client.V1ContainerPort(container_port=5678, host_ip="0.0.0.0", host_port=5678)],
This maps the default debugpy
server ports from container to host. Of course this could be either mapped with default values or custom CLI defined or anything else that would be suitable
In kubernetes_cli.py
line:180 which creates the step_cli
variable I changed the entrypoint to entrypoint="%s -m debugpy --listen 0.0.0.0:5678 --wait-for-client %s" % (executable, os.path.basename(sys.argv[0])),
This is specifically used to start the debugpy
session and let it wait until we attach to the process – ideally with already loaded breakpoints :)
Last using a custom debug launch.json
in VS code
"configurations": [ { "name": "Python: Remote Attach", "type": "python", "request": "attach", "connect": { "host": "<node ip>", "port": 5678 }, "pathMappings": [ { "localRoot": "<local path of your flow>", "remoteRoot": "/metaflow" } ], "justMyCode": true } ]
Potential future
It would be cool to have the option to start a metaflow resume
with debug
and kubernetes
resume debug –with kubernetes
or similar
This debug option could automatically install debugpy
in the target container and initiate the step using debugpy
. Port mapping might be a bit tricky, particularly in case of debugging of an foreach step. I am currently not sure what would be optimal here. Any ideas?
Drawback of the above outline solution
- Its only tested for Kubernetes and I have no idea if this would be an option for AWS Step Function, Batch, GCP and all the others as well
-
debugpy
is for VS code and I only tested it for this specific setup. Other IDEs like PyCharm etc. should work similarly but I do not know and have not tested it at all
Its a rough sketch but I hope I could describe the idea properly. Would you mind to consider it for you upcoming work? From my POV it would help a lot for many users of metaflow.
Happy to discuss.
Just realized that this issue is tangentially related to #739
Hi @saikonen,
by any chance do you have an opinion on the idea?
Hi @saikonen
I just swing by to check if you had the chance to check the above described topic? Looking forward to your response :)
All the best
Sorry for the long delay, finally back from my holidays :) Some immediate questions that came to mind
- on the security considerations with opening the node for direct access to the debugger session. Have you considered alternative ways of accessing the debugger running in the pod if direct access is restricted?
- Same concerns apply to other platforms as well, as direct access to the environment running tasks is by no means guaranteed. Of course the first draft could be limited to Kubernetes only.
- compared to implementing a custom debug decorator like described in the related issue, what are the biggest benefits of
debugpy
?
Had a quick test with your instructions, but I ran into a wall regarding direct node-access being blocked for our Kube cluster. I see the usage with the vscode plugin (or the CLI) being quite tedious though, requiring determining a node-IP, fiddling with the configuration, and attaching to the debugger. To my understanding debugging multiple steps of a flow would require going through the whole process for each step, as there is no guarantee that they run on the same node.
If you have some specific use cases in mind already for a debugger then these could be a good starting point for fleshing out what a debugging feature would look like feature-wise. I would like to try and outline the problems we're trying to tackle with a debugger before starting any implementations, but if you want to move forward with a PoC that is fine as well.
The recently announced Metaflow Office Hours meeting could also be a good opportunity to demo/have a discussion about the feature if you're interested and can attend. Details at https://outerbounds-community.slack.com/archives/C01TTBG855K/p1691533065990379
Sorry for my delayed response. Quite some busy days behind me and not looking so the next couple of weeks either but I at least wanted to answer a couple of your questions.
First of all: Thanks for getting back to this ticket, hope you had a nice and relaxing vacation time. Second: I currently have not tested any other way but I do fully understand your concerns here. I am pretty sure that it should be possible using a slightyl different approach which is much more security friendly. Unfortunately, currently I do not have the time to search for different possiblities. But I will certainly do so if I have more time in my hands. Third: Due to my business use case I can only test on-prem kubernetes cluster. All other platforms are totally unknown and to be honest I only have very limited knowledge about them as well. Fourth: Debugpy works seamlessly with VSCode but e.g. Pycharm favors another different debugger and I am certain there are many more. I tested it with debugpy because I use vscode and it has remote debugging capabilities (but I believe I read that the pycharm debugger owns this feature as well)
I fully support your approach to first understand the problem and design a solution rather jumping into conclusions. I am uncertain how I could support you with something like that. Is there anything from your side that might be needed where I can bring in my 2 cents?
Hi @saikonen
just one minor update with respect to your question "Have you considered alternative ways of accessing the debugger running in the pod if direct access is restricted?" -
I have tested the following
- Added
entrypoint="%s -m debugpy --listen 0.0.0.0:5678 --wait-for-client %s" % (executable, os.path.basename(sys.argv[0]))
inkubernetes_cli.py
- Added
ports=[client.V1ContainerPort(container_port=5678)],
in `kubernetes_job.py - Added a service on kubernetes mapping internal to external
- Connecting VS Code with the nodeIP and external port
service.yaml looks like this
kind: Service
apiVersion: v1
metadata:
name: debug-hostname-service
spec:
type: NodePort
selector:
app.kubernetes.io/name: metaflow-task
ports:
- nodePort: 30163
targetPort: 5678
Using dedicated "debug-labels" (instead of the current selector) one can assure that the right pods are redirected to the stable service nodePort. Would this be more suiteable from you POV?
addendum: The above shown approach works best if the kubernetes labels can be applied. Has this feature been reverted? At the moment I cannot find it anymore in the code base?!