pytorch icon indicating copy to clipboard operation
pytorch copied to clipboard

Make torch inductor (ahead of time compiled) .so file debuggable (with gdb/lldb)

Open bsergean opened this issue 1 year ago • 2 comments

🚀 The feature, motivation and pitch

Right now we have an odd behavior, and we'd like to step into the .so file when invoked from a C++ program, but since it is compiled without symbols we cannot.

Alternatives

Hack pytorch to add -g option somewhere (I have to say I didn't try).

Additional context

We're using pytorch 2.2.0.

cc @ezyang @msaroufim @bdhirsh @anijain2305 @zou3519

bsergean avatar Feb 23 '24 03:02 bsergean

Adding triage review because this is a pretty interesting request, maybe not even that hard, and I'm wondering why our prod folks have never asked for this

ezyang avatar Feb 24 '24 22:02 ezyang

Probably once the model has been 'launched' in production you can't always attach a debugger easily to hit, as halting things say for a realtime inference system isn't really an option.

But right now we are debugging why dynamic batching doesn't work the way we want, on our local machines, and I thought it would be just easier to step into the guts of the produced .so with a debugger.

A while ago we had tried to remove the .cubin files on cuda (we had no clue yet), and we attached a debugger, caught all exceptions, and caught a filesystem exception (when a kernel cannot be launched). This was helpful, maybe it could be in different contexts. Back then we had noticed that no symbols were left so we couldn't see where in the .so we were (this was with a 2.1 dev build).

bsergean avatar Feb 24 '24 23:02 bsergean

cc: @desertfire

Chillee avatar Feb 27 '24 19:02 Chillee

We do have an environment variable for that, AOT_INDUCTOR_DEBUG_COMPILE=1. https://github.com/pytorch/pytorch/blob/65efece3a4acf23fd3c38f5217c545cd989f9cda/torch/_inductor/config.py#L605

desertfire avatar Feb 27 '24 19:02 desertfire

Thanks !

bsergean avatar Feb 29 '24 22:02 bsergean

For context, we found the problem in our code, and it was related to not passing the correct cuda stream to the AOT run method.

bsergean avatar Feb 29 '24 22:02 bsergean