pytorch icon indicating copy to clipboard operation
pytorch copied to clipboard

Add structured logging for tensor fakeification

Open ezyang opened this issue 1 year ago • 4 comments

Stack from ghstack (oldest at bottom):

  • -> #126879

This adds dumps of MetaTensorDesc and MetaStorageDesc to structured logs when they are triggered from Dynamo. The logs look like this:

V0522 08:13:25.267000 140224882566144 torch/_subclasses/meta_utils.py:195] {"describe_storage": {"id": 0, "describer_id": 0, "size": 32}, "frame_id": 0, "frame_compile_id": 0, "attempt": 0}
V0522 08:13:25.267000 140224882566144 torch/_subclasses/meta_utils.py:220] {"describe_tensor": {"id": 0, "ndim": 1, "dtype": "torch.float32", "device": "device(type='cpu')", "size": [8], "is_leaf": true, "stride": [1], "storage": 0, "view_func": "<built-in method _view_func_unsafe of Tensor object at 0x7f882959e840>", "describer_id": 0}, "frame_id": 0, "frame_compile_id": 0, "attempt": 0}
V0522 08:13:25.268000 140224882566144 torch/_subclasses/meta_utils.py:1594] {"describe_source": {"describer_id": 0, "id": 0, "source": "L['x']"}, "frame_id": 0, "frame_compile_id": 0, "attempt": 0}

The describer_id is used to disambiguate ids. We expect it to be unique per frame id, but if there is a bug it possibly is not. Note you will get redundant dumps when evaluation restarts.

tlparse can use this to give a visualization of input tensors to a model, you could also use this to generate example inputs to run graphs on.

Some care is taken to avoid redumping the tensor metadata multiple times, which would happen ordinarily because AOTAutograd refakifies everything after Dynamo, to deal with metadata mutation.

Partially fixes https://github.com/pytorch/pytorch/issues/126644

Signed-off-by: Edward Z. Yang [email protected]

ezyang avatar May 22 '24 15:05 ezyang

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/126879

Note: Links to docs will display an error until the docs builds have been completed.

:heavy_exclamation_mark: 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

:x: 2 New Failures, 2 Unrelated Failures

As of commit cfeec3e1a3ff54193056ee1befb65717667d886f with merge base 0910429d7262daf67dc3aa1d4e4aa939752ae675 (image):

NEW FAILURES - The following jobs have failed:

UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot[bot] avatar May 22 '24 15:05 pytorch-bot[bot]

cc @bhack

ezyang avatar May 22 '24 15:05 ezyang

Thanks, Is this enough to isolate a failing compiled function in a minimal repro format? E.g. If we consider a recent report stacktrace https://github.com/pytorch/pytorch/issues/126614#issuecomment-2122567229 that failure could be generated by a parent def decorated compilation or not. In the simplest case is the direct decorated function at https://github.com/pytorch/pytorch/issues/121504#issue-2176370853.

As you see the compiled forward is def forward(self, q, k, v):.

So if we want to really create a minimal repro in python and isolate that function from the rest of the code I suppose that we need to find a way to create fake q, k, v tensors but also to serialize something from the class for self.

Or do you think we could have another solution about fast feeding minimal repro in compilers tickets?

bhack avatar May 22 '24 16:05 bhack

In the case we are really not interested in the python source code at all for compilation errors reporting (but I am really not sure about this point) probably we could just highlight to copy/save the inductor generated specific failure code to the user.

In the mentioned case e.g. the compiled code already know that we have failed at:

 File "/tmp/torchinductor_root/ut/cutmbnzthsr64p23ilpnn2ym54twqj4lwpqj5v3shylgqucshcur.py", line 660

So we could just suggest to the user to attach that one to the ticket + this tensor fakeification info.

bhack avatar May 22 '24 16:05 bhack

I fixed the memo problem: I can't actually hold on to MetaTensorDesc as it will keep real tensors live lol

ezyang avatar May 28 '24 02:05 ezyang

CI is green here

ezyang avatar May 29 '24 18:05 ezyang

@pytorchbot merge -i

ezyang avatar May 31 '24 01:05 ezyang