distel
distel copied to clipboard
got a corrupted external term
Hi,
I keep running into this error and I'm a bit lost regarding how to start debugging it. I'm not quite sure how it happens, but in very vague terms it seems to happen when I send too much logging data from Erlang to Emacs. Once this error triggers, the distel communication gets into a bad state, and the only way I've found to fix it is to restart emacs.
=WARNING REPORT==== 12-Apr-2020::10:37:00 === '[email protected]' got a corrupted external term from 'distel_27211@panda' on distribution channel 28280 <<...,131,104,4,97,6,103,100,0,18,100,105,115,116,101,108,95,50,55,50,49,49,64,112,97,110,100,97,0,0,188,139,0,0,0,0,0,100,0,0,100,0,3,114,101,120,131,104,2,103,100,0,18,100,105,115,116,101,108,95,50,55,50,49,49,64,112,97,110,100,97,0,0,188,139,0,0,0,0,0,104,5,100,0,4,99,97,108,108,100,0,4,99,111,100,101,100,0,13,101,110,115,117,114,101,95,108,111,97,100,101,100,108,0,0,0,1,100,0,6,100,105,115,116,101,108,106,103,100,0,18,100,105,115,116,101,108,95,50,55,50,49,49,64,112,97,110,100,97,0,0,0,1,0,0,0,0,0>> ATOM_CACHE_REF translations: none
Taking that apart gives a term
Term = <<
131, %% ETF
104,4, %% SMALL_TUPLE_EXT
97,6,%% SMALL_INTEGER_EXT
103, %% PID_EXT
100, %% ATOM_EXT
0,18,
100,105,115,116,101,108,95,50,55,50,49,49,64,112,97,110,100,97, %% Node: "distel_27211@panda"
0,0,188,139, %% ID
0,0,0,0, %% Serial
0, %% Creation
100, %% ATOM_EXT
0,0,
100, %% ATOM_EXT
0,3,
114,101,120>>.
followed by another term 131, ....
binary_to_term(Term). gives exception error: bad argument, as does binary_to_term(Pid) below, so it seems to be the PID_EXT.
Pid = << 131, 103, %% PID_EXT 100, %% ATOM_EXT 0,18, 100,105,115,116,101,108,95,50,55,50,49,49,64,112,97,110,100,97, %% Node: "distel_27211@panda" 0,0,188,139, %% ID 0,0,0,0, %% Serial 0>>. %% Creation
The format seems to be ok, so what causes the term parse error?
Based on https://github.com/massemanet/distel/issues/68#issuecomment-615872930:
the ID of the PID is 48267 which is 1011110010001011. ETF says ID "Only 15 bits are significant; the rest are to be 0." Looks like the 16th bit may be causing trouble.
If I change that Term to something that doesn't have bit 15 set then indeed it does parse with binary_to_term/1. Am I spawning too many processes? I do have a lot of transient processes in the logger I mentioned.
It’s not your fault. I will see if I can send in a fix today. There is likely a bug in distel that does not handle ID wraparound correctly.
Could you try the PR and see if it has any ill effect. Thanks.
Thanks. I've switched to f9c5067 as my daily driver. I'll let you know how it goes. I can confirm that in the buggy version the problems indeed started to happen when erl-pid-counter went over 32k.
Just FYI: Maybe I have github misconfigured, but it seems to not send notifications for message edits. I just saw your fix when I checked back to see if I could do some debugging myself... It does seem to send notifications for new messages.
My bad. I was not fully aware of how gh notifications work. I might amend the PR with some doc fix as follows.
--- a/elisp/erl.el
+++ b/elisp/erl.el
@@ -58,7 +58,7 @@ (defun make-erl-local-pid ()
pid))
(defun erl-get-counter (counter)
- "Get and then advance counter."
+ "Get and then advance COUNTER."
(cl-macrolet ((modf (place y) `(cl-callf mod ,place ,y))
(roundf (place) `(cl-callf round ,place)))
(cl-labels ((increment (cells)
@@ -83,7 +83,7 @@ (defun erl-determine-hostname ()
;; (NUM SERIAL)
(defvar erl-pid-counter '((0 . #x8000) (0 . #x2000))
- "Counters for PIDs.")
+ "Counter for PIDs.")
(defvar erl-node-name
(intern (format "distel_%S@%s" (emacs-pid) (erl-determine-hostname)))
I'll keep an eye on the counter :)
Currently at: erl-pid-counter ((8917 . 32768) (0 . 8192))
Things started to misbehave once it wrapped. Currently at ((9 . 32768) (1 . 8192))
My code is a horrible rube goldberg machine so I'm not really clear what exactly happens. I'll look into this some more to see if I can get you a decent error log or a fix.
I ran a test until the next wrap around and something went wrong again.
In the minibuffer, I get "Buffer "derl [email protected]" has a running process; kill it? (yes or no) "
The beam says: =ERROR REPORT==== 1-May-2020::18:38:55 === ** Node distel_13251@panda not responding ** ** Removing (timedout) connection **
I get an error message in the emacs Messages log: erl-pid->buffer: No buffer for pid [TYPE erl-pid distel_13251@panda 34 1 0]
And the erl-pid-counter is at ((35 . 32768) (2 . 8192))
So it seems that the 34 1 0 in the error message is supposed to be 34 2 0.
I think I found the issue in:
(defun erl-pid->buffer (pid) "Get PID's buffer." (or (cdr (assoc (erl-pid-id pid) erl-process-buffer-alist)) (error "No buffer for pid %S" pid)))
That only uses erl-pid-id and not erl-pid-serial.
My code fails because right after wraparound 1->2 this no longer finds a registered process due to erl-pid->buffer failing for [TYPE erl-pid distel_13251@panda 34 1 0]
Also erl-unenroll-process removes processes if just the id matches. Seems like the fix would be to change the keying for erl-process-buffer-alist.
@zwizwa thanks for catching this. I have added another commit to the PR which hopefully should fix this issue.
Thanks I've switched to baaa46bc4af425ae906a914a5c821dc74a25a9cf for daily use.
It looks ok now. I've not run into any of these issues again, and currently erl-pid-counter is at ((14075 . 32768) (4 . 8192)) Thanks again.
@zwizwa Excellent news. I added another commit just now to make sure make-erl-local-pid
never overrides existing pids. This bug is much harder to trigger but is possible. Are you able to give it a brief test to make sure everything is still working? Thanks.
I switched to df6d495c6c55c4900c2bc4282f762d8fd0ebed04
I think this broke something. Starting to run into errors like this: EXIT: <0.16.0> [emacs-error "(error "erl-pid-node accessing a non-erl-pid")"] I'll see if I can trace it down.
It may be a fault of erl-null-pid-p
. I have amended last commit to make it behave the same as before. Could you check again and see if the problem persists? Thanks a lot.
@zwizwa are you able to try 3b04b664c03670a43080bfd5958313ff6cdafc2d? If there are no issues we can suggest to the maintainers to merge the PR. Thanks.
Sorry, busy week. I'll give it a try tomorrow after work.
Tried 3b04b664c03670a43080bfd5958313ff6cdafc2d but it broke something. Reverted to previous but still broken so this is likely my bug. I'll look again tomorrow with a fresh eye.
@zwizwa Thanks a lot for testing. BTW there is no hurry we can take as much time as needed.
You're welcome. I really appreciate your maintenance effort. Distel is an integral part of my setup. What would I do without it :)
The bug was indeed mine. I'll switch back to the new head after work today.
It's been running without issues the whole week.
@zwizwa excellent news!
@massemanet PR #69 is ready to merge.