nix icon indicating copy to clipboard operation
nix copied to clipboard

Add context to better locate runtime coercions

Open layus opened this issue 2 years ago • 21 comments

Add tons of helpful error messages in locations with poor error reporting. This includes mostly type errors. As these are runtime errors, the precise location of the faulty value is usually lost at that time. This translates textually the type expectations (requirements ?) hard-coded in the sources.

TODOs:

  • [X] Also annotate coerceToString
  • [X] Find and annotate remaining functions bearing expectations on the values. (I guess this is good enough)
  • [X] Wrap some function calls with debug info

Fixes #6200 #6191

layus avatar Mar 04 '22 04:03 layus

Have fun by starting with ./outputs/out/bin/nix-instantiate --eval --expr 'builtins.genericClosure false' and ending with ./outputs/out/bin/nix-instantiate --eval --expr 'builtins.sort (a: b: a.key < b.key) (builtins.genericClosure {startSet = [{key = 1;}]; operator = x: if x.key < 5 then [{key = x.key + 1;} {key = x.key * 2;}] else [];})' by trying to trigger all the possible type errors in between :-)

layus avatar Mar 04 '22 04:03 layus

image

:tada:

layus avatar Mar 05 '22 21:03 layus

More examples: https://maudoux.be/changes.html

layus avatar Mar 05 '22 22:03 layus

Well, it has some impact on performance, but that remains under control:

This PR:
nixos.smallContainer.time     0.762153  0.975931  0.736349  0.775949
nixos.kde.time                1.44574   1.6254    1.48982   1.59174
nixos.lapp.time               1.1987    1.3634    1.19679   1.30787
nix-env.qa.time               6.18346   6.71418   6.32466   6.43977
nix-env.qaDrv.time            44.1625   50.7246   53.6632   46.6902
nix-env.qaAggressive.time     6.31606   6.67477   9.45704   6.60786
nix-env.qaDrvAggressive.time  44.3691   48.075    60.2891   45.648

Baseline:
nixos.smallContainer.time     0.71367  0.705955  0.71822  0.715239
nixos.kde.time                1.37252  1.40253   1.47962  1.38685
nixos.lapp.time               1.12598  1.14188   1.22673  1.12473
nix-env.qa.time               6.12947  6.16216   6.56867  6.22066
nix-env.qaDrv.time            42.1374  42.1028   54.5509  42.0571
nix-env.qaAggressive.time     6.79885  6.00412   9.02374  6.36788
nix-env.qaDrvAggressive.time  45.5595  42.1364   55.0709  42.3918

output

layus avatar Mar 07 '22 21:03 layus

Do you want to update TODOs, mark this draft, or save things for future PRs?

Ericson2314 avatar Mar 07 '22 22:03 Ericson2314

Did you consider using the existing call stack stuff?

Relatedly, it would be nice to use the positions of the subterm that is trying to be casted. But I see that was discussed at https://github.com/NixOS/nix/issues/6200

I would not be opposed to merging this as-is, and circling back to these issues, but others may disagree.

Ericson2314 avatar Mar 07 '22 22:03 Ericson2314

Do you want to update TODOs, mark this draft, or save things for future PRs?

I guess this is as good as I can achieve in reasonable time.

Did you consider using the existing call stack stuff?

Well, first I thought it would not work. Now I am not too sure anymore. It should be feasible, but it requires shifting around several pos and rework how we handle them. It is a bit more involved, and the try/catch syntax adds lots of clutter. Maybe recent C++ lambdas may alleviate this issue. Apparently the try{}catch may have zero cost when not used. That would definitely be the right way forward if it is the case. So, yes, I think it could be superior if well used. It would certainly avoid mismatch between error messages and positions, as we have now.

That being said, the current version has the advantage that it requires any user of these functions to provide an explanation string, which ensures that we do not inadvertently omit them.

Relatedly, it would be nice to use the positions of the subterm that is trying to be casted. But I see that was discussed at #6200

It is extremely complex in nix, as often there is no good position for values. We do keep some positions for the keys of attribute sets, and sometimes for functions, when feasible.

For example

let
  list = [ 1 "a" {attr="value";} ];
  fun = builtins.map (x: x+1);
  afterList = builtins.seq list;
  getThird = l: builtins.elemAt l 2;
in
  afterList (getThird (fun list ++ [3]))

fails with

   1 error: cannot coerce a set to a string
   2 
   3        at /home/layus/projects/nix/complex.nix:4:26:
   4 
   5             3|   list = [ 1 "a" {attr="value";} ];
   6             4|   fun = builtins.map (x: x+1);
   7              |                          ^
   8             5|   afterList = builtins.seq list;
   9 
  10        … while evaluating anonymous lambda
  11 
  12        at /home/layus/projects/nix/complex.nix:4:23:
  13 
  14             3|   list = [ 1 "a" {attr="value";} ];
  15             4|   fun = builtins.map (x: x+1);
  16              |                       ^
  17             5|   afterList = builtins.seq list;
  18 
  19        … from call site
  20 
  21        … while evaluating 'getThird'
  22 
  23        at /home/layus/projects/nix/complex.nix:6:14:
  24 
  25             5|   afterList = builtins.seq list;
  26             6|   getThird = l: builtins.elemAt l 2;
  27              |              ^
  28             7| in
  29 
  30        … from call site
  31 
  32        at /home/layus/projects/nix/complex.nix:8:14:
  33 
  34             7| in
  35             8|   afterList (getThird (fun list ++ [3]))
  36              |              ^
  37             9|

And nowhere is there any mention of the list itself, nor the value that failed to be a proper int. And that would be difficult. Would you point list ++ [3], list = [ 1 "a" {attr="value";} ], [ 1 "a" {attr="value";} ], {attr="value";}, or something else ?

:bulb: One thing we could do is print the fautly values to help users get a ffeling of what happens.

I would not be opposed to merging this as-is, and circling back to these issues, but others may disagree.

I think it is ready to merge. Better merge early and improve often :-).

layus avatar Mar 08 '22 10:03 layus

So, yes, I think it could be superior if well used. It would certainly avoid mismatch between error messages and positions, as we have now.

That being said, the current version has the advantage that it requires any user of these functions to provide an explanation string, which ensures that we do not inadvertently omit them.

Yeah, I think the cast functions should take the argument, and do the try-catch themselves on behalf of the caller. That still ensures one doesn't forget to provide a reason for the cast!

Ericson2314 avatar Mar 09 '22 03:03 Ericson2314

Very nice looking!

Ericson2314 avatar Mar 18 '22 00:03 Ericson2314

Except for painful conflicts... I guess this will have to wait the hackathon tomorrow.

layus avatar Mar 18 '22 00:03 layus

@Ericson2314 Not sure I will be able to join the hacking session, but if you could find some time to review the changes, that would be much appreciated. I am quite happy with the result, but I lack real tests, and there are so many changes that typos can always slip in.

layus avatar Mar 18 '22 09:03 layus

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/tweag-nix-dev-update-26/18252/1

nixos-discourse avatar Mar 18 '22 14:03 nixos-discourse

@Ericson2314 https://github.com/NixOS/nix/pull/6204/commits/963b8aa39b169bf5c054449ddce39d60faacf298 just for you :-). I hope you like it.

layus avatar Mar 18 '22 22:03 layus

would definitely be interesting to have another look at the performance impact now. not having gone over it again completely this doesn't look like it should cause performance regressions, so if it does there may still be room for optimizations.

pennae avatar Mar 31 '22 21:03 pennae

Suggestions from NixUX:

  • '--show-tracecan become overwhelming, even more than today. Perhaps respect-v` verbosity flags?

  • New messages can be a less severe level in order to limit the amount displayed by default.

  • consider a new PR or doing it in the current PR

  • It is hard to keep error message testing up to date.

  • performance review @pennae?

tomberek avatar Apr 07 '22 19:04 tomberek

we've run this PR through our perf benchmarks, and while we do see an increase in instructions executed the real-world impact on walltime is in the noise. turning the string_view refs into plain string_views brings measurements even closer together.

performance-wise this looks pretty good. :)

pennae avatar Apr 07 '22 21:04 pennae

From NixUX: Thanks @pennae for the perf review. This still requires a way to control and limit the severity of the trace.

tomberek avatar Apr 14 '22 19:04 tomberek

Suggestions from NixUX:

* '--show-trace`can become overwhelming, even more than today. Perhaps respect`-v` verbosity flags?

I am not sure this is true. Do we have examples ? The full trace is even bigger, but now it should be readable, with errors pointing at precise locations. See https://gist.github.com/layus/0c4203d41a67326367caf2086f20a2ab#file-basic-diff for an example where the added information allows to understand the error location.

https://gist.github.com/layus/0c4203d41a67326367caf2086f20a2ab#file-poetry-diff shows an example where the trace is made way bigger, and potentially unnecessarily verbose.

I still do not see what this PR should do about it. We had a situation where plain misleading messages were printed. Not it gets easier for newcomers to understand the errors and the required types they need in simple cases. Complex cases become more verbose, but also more easy to understand and to follow, as the increment between the trace steps is smaller. Less wild jumps in the code between two seemingly unrelated parts.

And by default you get the first parts of the trace without needing --show-trace. Meaning that if you really, really want to see the full trace, you should use more or less (pun intended).

I would argue that we should prefer correctness and precision over terseness. This PR cannot reasonably bundle a solution to the very complex issue of printing a meaningful trace in a lazy language.

* New messages can be a less severe level in order to limit the amount displayed by default.

I would argue that the old messages should be less chatty ;-). Just kidding, but as above, the code could benefit from a redesign of the traces to show another projection of the full trace. Maybe expressions trace until the first function scope, and then only the function calls stack ? That is way out of context for this PR I think.

* consider a new PR or doing it in the current PR

:+1: for new PR.

* It is hard to keep error message testing up to date.

It depends on what you want to test. This PR made it sure that more error cases are documented via a mandatory text explanation. Maybe that is the "testing" we need. Any other testing is going to be brittle with respect to format changes.

* performance review @pennae?

Thanks @pennae !

layus avatar Apr 28 '22 10:04 layus

I will merge Eelco's comments, and rebase onto master. Wait for it :-).

layus avatar Apr 28 '22 10:04 layus

That is it for today. I think it should work now @bburdette :tada: :yum:

See https://gist.github.com/layus/e10b4fab9938e9a1777530fe296f0623 for an example on how it behaves with a nixos module error.

layus avatar Apr 28 '22 22:04 layus

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/tweag-nix-dev-update-35/21701/1

nixos-discourse avatar Sep 15 '22 09:09 nixos-discourse

Triaged in Nix team meeting on 2022-11-25:

  • @thufschmitt: decided some time ago that we want to have it. @layus fixed merge conflicts, but a trivial one is still left
  • @edolstra: would this cause a major merge conflict with lazy trees?
    • @thufschmitt: have some some thoughts on that, but should discuss those in a different context

-> assigned to @edolstra

fricklerhandwerk avatar Nov 30 '22 23:11 fricklerhandwerk

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/2022-11-25-nix-team-meeting-minutes-11/23601/1

nixos-discourse avatar Dec 01 '22 00:12 nixos-discourse

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/2022-12-09-nix-team-meeting-minutes-15/23951/1

nixos-discourse avatar Dec 12 '22 10:12 nixos-discourse

@layus Merged, thanks!

edolstra avatar Jan 02 '23 21:01 edolstra

Unfortunately this PR has caused a regression in 2.13.0.

Symptoms include infinite recursions and some sort of regression in the module system. I've tried to find the root cause by

  • bisecting, but the git history of this change isn't conducive to it
  • reading through the change for two potential causes
    • accidentally forcing more evaluation than necessary
    • memory corruption

I haven't managed to find it, so for the time being, I've reverted this change. Any help would be much appreciated, as we'd like to have this change back, both because it is a very welcome change and because other PRs would not to have conflicts resolved if we don't manage to merge this soon.

roberth avatar Jan 18 '23 11:01 roberth

Do we have any test cases for these issues? I think a PR with the "revert of the revert" + some failing tests would be a good starting point, and then hopefully @layus or another person can pick up the baton from there.

Ericson2314 avatar Jan 18 '23 23:01 Ericson2314

Hi, having a look right now. There are plenty of potential fixes. Finding failing tests would be tremendously useful indeed. Any help welcome. For now I am focusing on reproducing the nixpkgs-lib-tests failures.

layus avatar Jan 19 '23 08:01 layus

@layus so far I have not seen a behavioral regression, only a change (and improvement in the error message) in the error message output. I added a reproducer to more quickly try out Nix versions: https://github.com/NixOS/nix/pull/7621#issuecomment-1387718654

tomberek avatar Jan 19 '23 08:01 tomberek

So first there is indeed a change where I introduced "frame" traces (pardon the naming here). Think of them as more important trace points, like function calls and builtins.derivation calls. This explains why the addErrorTrace thing is invisible: it happens after a function call in the trace. I have confirmed that removing this notion of frame traces fixes the nixpkgs tests errors. There are two solutions

  1. disable frame traces
  2. make addErrorTrace a frame trace.

I think 2 is better because we keep the terseness of the output, and it even makes sense, as a message that a human took time to write for debugging purposes deserves to be printed in the end :-). Working on it.

layus avatar Jan 19 '23 08:01 layus