distel icon indicating copy to clipboard operation
distel copied to clipboard

got a corrupted external term

Open zwizwa opened this issue 4 years ago • 25 comments

Hi,

I keep running into this error and I'm a bit lost regarding how to start debugging it. I'm not quite sure how it happens, but in very vague terms it seems to happen when I send too much logging data from Erlang to Emacs. Once this error triggers, the distel communication gets into a bad state, and the only way I've found to fix it is to restart emacs.

=WARNING REPORT==== 12-Apr-2020::10:37:00 === '[email protected]' got a corrupted external term from 'distel_27211@panda' on distribution channel 28280 <<...,131,104,4,97,6,103,100,0,18,100,105,115,116,101,108,95,50,55,50,49,49,64,112,97,110,100,97,0,0,188,139,0,0,0,0,0,100,0,0,100,0,3,114,101,120,131,104,2,103,100,0,18,100,105,115,116,101,108,95,50,55,50,49,49,64,112,97,110,100,97,0,0,188,139,0,0,0,0,0,104,5,100,0,4,99,97,108,108,100,0,4,99,111,100,101,100,0,13,101,110,115,117,114,101,95,108,111,97,100,101,100,108,0,0,0,1,100,0,6,100,105,115,116,101,108,106,103,100,0,18,100,105,115,116,101,108,95,50,55,50,49,49,64,112,97,110,100,97,0,0,0,1,0,0,0,0,0>> ATOM_CACHE_REF translations: none

zwizwa avatar Apr 12 '20 14:04 zwizwa

Taking that apart gives a term

Term = << 131, %% ETF 104,4, %% SMALL_TUPLE_EXT
97,6,%% SMALL_INTEGER_EXT 103, %% PID_EXT 100, %% ATOM_EXT 0,18, 100,105,115,116,101,108,95,50,55,50,49,49,64,112,97,110,100,97, %% Node: "distel_27211@panda" 0,0,188,139, %% ID 0,0,0,0, %% Serial 0, %% Creation 100, %% ATOM_EXT 0,0, 100, %% ATOM_EXT 0,3, 114,101,120>>.

followed by another term 131, ....

binary_to_term(Term). gives exception error: bad argument, as does binary_to_term(Pid) below, so it seems to be the PID_EXT.

Pid = << 131, 103, %% PID_EXT 100, %% ATOM_EXT 0,18, 100,105,115,116,101,108,95,50,55,50,49,49,64,112,97,110,100,97, %% Node: "distel_27211@panda" 0,0,188,139, %% ID 0,0,0,0, %% Serial 0>>. %% Creation

The format seems to be ok, so what causes the term parse error?

zwizwa avatar Apr 18 '20 13:04 zwizwa

Based on https://github.com/massemanet/distel/issues/68#issuecomment-615872930:

the ID of the PID is 48267 which is 1011110010001011. ETF says ID "Only 15 bits are significant; the rest are to be 0." Looks like the 16th bit may be causing trouble.

leoliu avatar Apr 19 '20 01:04 leoliu

If I change that Term to something that doesn't have bit 15 set then indeed it does parse with binary_to_term/1. Am I spawning too many processes? I do have a lot of transient processes in the logger I mentioned.

zwizwa avatar Apr 19 '20 21:04 zwizwa

It’s not your fault. I will see if I can send in a fix today. There is likely a bug in distel that does not handle ID wraparound correctly.

Could you try the PR and see if it has any ill effect. Thanks.

leoliu avatar Apr 20 '20 01:04 leoliu

Thanks. I've switched to f9c5067 as my daily driver. I'll let you know how it goes. I can confirm that in the buggy version the problems indeed started to happen when erl-pid-counter went over 32k.

Just FYI: Maybe I have github misconfigured, but it seems to not send notifications for message edits. I just saw your fix when I checked back to see if I could do some debugging myself... It does seem to send notifications for new messages.

zwizwa avatar Apr 22 '20 14:04 zwizwa

My bad. I was not fully aware of how gh notifications work. I might amend the PR with some doc fix as follows.

--- a/elisp/erl.el
+++ b/elisp/erl.el
@@ -58,7 +58,7 @@ (defun make-erl-local-pid ()
     pid))
 
 (defun erl-get-counter (counter)
-  "Get and then advance counter."
+  "Get and then advance COUNTER."
   (cl-macrolet ((modf (place y) `(cl-callf mod ,place ,y))
                 (roundf (place) `(cl-callf round ,place)))
     (cl-labels ((increment (cells)
@@ -83,7 +83,7 @@ (defun erl-determine-hostname ()
 
 ;; (NUM SERIAL)
 (defvar erl-pid-counter '((0 . #x8000) (0 . #x2000))
-  "Counters for PIDs.")
+  "Counter for PIDs.")
 
 (defvar erl-node-name
   (intern (format "distel_%S@%s" (emacs-pid) (erl-determine-hostname)))

leoliu avatar Apr 23 '20 01:04 leoliu

I'll keep an eye on the counter :)

Currently at: erl-pid-counter ((8917 . 32768) (0 . 8192))

zwizwa avatar Apr 23 '20 01:04 zwizwa

Things started to misbehave once it wrapped. Currently at ((9 . 32768) (1 . 8192))

My code is a horrible rube goldberg machine so I'm not really clear what exactly happens. I'll look into this some more to see if I can get you a decent error log or a fix.

zwizwa avatar May 01 '20 22:05 zwizwa

I ran a test until the next wrap around and something went wrong again.

In the minibuffer, I get "Buffer "derl [email protected]" has a running process; kill it? (yes or no) "

The beam says: =ERROR REPORT==== 1-May-2020::18:38:55 === ** Node distel_13251@panda not responding ** ** Removing (timedout) connection **

I get an error message in the emacs Messages log: erl-pid->buffer: No buffer for pid [TYPE erl-pid distel_13251@panda 34 1 0]

And the erl-pid-counter is at ((35 . 32768) (2 . 8192))

So it seems that the 34 1 0 in the error message is supposed to be 34 2 0.

zwizwa avatar May 01 '20 22:05 zwizwa

I think I found the issue in:

(defun erl-pid->buffer (pid) "Get PID's buffer." (or (cdr (assoc (erl-pid-id pid) erl-process-buffer-alist)) (error "No buffer for pid %S" pid)))

That only uses erl-pid-id and not erl-pid-serial.

My code fails because right after wraparound 1->2 this no longer finds a registered process due to erl-pid->buffer failing for [TYPE erl-pid distel_13251@panda 34 1 0]

zwizwa avatar May 01 '20 23:05 zwizwa

Also erl-unenroll-process removes processes if just the id matches. Seems like the fix would be to change the keying for erl-process-buffer-alist.

zwizwa avatar May 01 '20 23:05 zwizwa

@zwizwa thanks for catching this. I have added another commit to the PR which hopefully should fix this issue.

leoliu avatar May 02 '20 03:05 leoliu

Thanks I've switched to baaa46bc4af425ae906a914a5c821dc74a25a9cf for daily use.

zwizwa avatar May 04 '20 18:05 zwizwa

It looks ok now. I've not run into any of these issues again, and currently erl-pid-counter is at ((14075 . 32768) (4 . 8192)) Thanks again.

zwizwa avatar May 10 '20 16:05 zwizwa

@zwizwa Excellent news. I added another commit just now to make sure make-erl-local-pid never overrides existing pids. This bug is much harder to trigger but is possible. Are you able to give it a brief test to make sure everything is still working? Thanks.

leoliu avatar May 10 '20 22:05 leoliu

I switched to df6d495c6c55c4900c2bc4282f762d8fd0ebed04

zwizwa avatar May 11 '20 16:05 zwizwa

I think this broke something. Starting to run into errors like this: EXIT: <0.16.0> [emacs-error "(error "erl-pid-node accessing a non-erl-pid")"] I'll see if I can trace it down.

zwizwa avatar May 11 '20 16:05 zwizwa

It may be a fault of erl-null-pid-p. I have amended last commit to make it behave the same as before. Could you check again and see if the problem persists? Thanks a lot.

leoliu avatar May 11 '20 22:05 leoliu

@zwizwa are you able to try 3b04b664c03670a43080bfd5958313ff6cdafc2d? If there are no issues we can suggest to the maintainers to merge the PR. Thanks.

leoliu avatar May 19 '20 00:05 leoliu

Sorry, busy week. I'll give it a try tomorrow after work.

zwizwa avatar May 19 '20 01:05 zwizwa

Tried 3b04b664c03670a43080bfd5958313ff6cdafc2d but it broke something. Reverted to previous but still broken so this is likely my bug. I'll look again tomorrow with a fresh eye.

zwizwa avatar May 20 '20 05:05 zwizwa

@zwizwa Thanks a lot for testing. BTW there is no hurry we can take as much time as needed.

leoliu avatar May 20 '20 07:05 leoliu

You're welcome. I really appreciate your maintenance effort. Distel is an integral part of my setup. What would I do without it :)

The bug was indeed mine. I'll switch back to the new head after work today.

zwizwa avatar May 20 '20 17:05 zwizwa

It's been running without issues the whole week.

zwizwa avatar May 27 '20 12:05 zwizwa

@zwizwa excellent news!

@massemanet PR #69 is ready to merge.

leoliu avatar May 27 '20 14:05 leoliu