erlangpl icon indicating copy to clipboard operation
erlangpl copied to clipboard

Can't analyze Canillita

Open elbrujohalcon opened this issue 7 years ago • 15 comments

When I try to analyze inaka/canillita, I found an error. These are the steps I follow:

$ git clone https://github.com/inaka/canillita.git
$ cd canillita
$ make rel
$ _rel/canillita/bin/canillita console

…then, in another shell…

$ ./erlangpl --node [email protected] --cookie canillita -vvv

…where I get the following output…

DEBUG: Plain args: ["./erlangpl","--node","[email protected]","--cookie",
                    "canillita","-vvv"]
DEBUG: Args: [{node,"[email protected]"},
              {cookie,"canillita"},
              {verbose,3},
              "./erlangpl"]
DEBUG: Loading: epl/priv/.gitignore
DEBUG: Loading: epl/priv/htdocs/asset-manifest.json
DEBUG: Loading: epl/priv/htdocs/favicon.ico
DEBUG: Loading: epl/priv/htdocs/index.html
DEBUG: Loading: epl/priv/htdocs/static/css/main.b813327a.css
DEBUG: Loading: epl/priv/htdocs/static/css/main.b813327a.css.map
DEBUG: Loading: epl/priv/htdocs/static/js/main.1ce390f5.js
DEBUG: Loading: epl/priv/htdocs/static/media/fontawesome-webfont.674f50d2.eot
DEBUG: Loading: epl/priv/htdocs/static/media/fontawesome-webfont.912ec66d.svg
DEBUG: Loading: epl/priv/htdocs/static/media/fontawesome-webfont.af7ae505.woff2
DEBUG: Loading: epl/priv/htdocs/static/media/fontawesome-webfont.b06871f2.ttf
DEBUG: Loading: epl/priv/htdocs/static/media/fontawesome-webfont.fee66e71.woff
DEBUG: Loading: epl/priv/htdocs/static/media/glyphicons-halflings-regular.448c34a5.woff2
DEBUG: Loading: epl/priv/htdocs/static/media/glyphicons-halflings-regular.89889688.svg
DEBUG: Loading: epl/priv/htdocs/static/media/glyphicons-halflings-regular.e18bbf61.ttf
DEBUG: Loading: epl/priv/htdocs/static/media/glyphicons-halflings-regular.f4769f9b.eot
DEBUG: Loading: epl/priv/htdocs/static/media/glyphicons-halflings-regular.fa277232.woff
DEBUG: Loading: epl/priv/htdocs/static/media/header_copy.4bb83c59.png
INFO:  Plugins []
INFO:  Started plugins: [{epl_dashboard_EPL,{ok,<0.90.0>}},
                         {epl_st_EPL,{ok,<0.91.0>}},
                         {epl_version_EPL,{ok,<0.92.0>}},
                         {epl_traffic_EPL,{ok,<0.93.0>}}]
Visit http://localhost:8000/

but once I open my browser at localhost:8000, I get the following error report in that console:

ERROR: timed out while collecting data from node

=ERROR REPORT==== 3-Apr-2017::10:53:07 ===
** Generic server epl_st terminating
** Last message in was {data,
                           {'[email protected]',{1491,227582,456213}},
                           [{process_count,135},
                            {memory_total,37707288},
                            {spawn,[]},
                            {exit,[]},
                            {send,
                                [{{#Port<10020.7107>,<10020.254.0>},1,0},
                                 {{<10020.254.0>,<10020.259.0>},1,0}]},
                            {send_self,[]},
                            {'receive',
                                [{#Port<10020.7107>,1,12},
                                 {<10020.158.0>,5,0},
                                 {<10020.254.0>,1,3}]},
                            {trace,[]}]}
** When Server state == {state,[]}
** Reason for termination ==
** {{timeout,{gen_server,call,
                         [epl_tracer,
                          {command,#Fun<supervisor.which_children.1>,
                                   [<10020.184.0>]}]}},
    [{gen_server,call,2,[{file,"gen_server.erl"},{line,204}]},
     {epl_st,command,2,[{file,"src/epl_st.erl"},{line,120}]},
     {epl_st,generate_sup_tree,1,[{file,"src/epl_st.erl"},{line,104}]},
     {epl_st,'-handle_info/2-fun-0-',2,[{file,"src/epl_st.erl"},{line,76}]},
     {lists,foldl,3,[{file,"lists.erl"},{line,1263}]},
     {epl_st,handle_info,2,[{file,"src/epl_st.erl"},{line,75}]},
     {gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,601}]},
     {gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,667}]}]}

=INFO REPORT==== 3-Apr-2017::10:53:07 ===
    application: epl_st
    exited: shutdown
    type: temporary
ERROR: timed out while collecting data from node
ERROR: timed out while collecting data from node
ERROR: timed out while collecting data from node
ERROR: timed out while collecting data from node
ERROR: timed out while collecting data from node

elbrujohalcon avatar Apr 03 '17 13:04 elbrujohalcon

Does canillita start a lot of workers under some supervisor? That's my first guess, because one of the plugins timed out while querying tracer process for children of one of the supervisors.

arkgil avatar Apr 03 '17 14:04 arkgil

Or is that machine under heavy load, by not responding within the default of 5000ms ? From the crash it's doesn't look insanely busy. mmm

ruanpienaar avatar Apr 03 '17 14:04 ruanpienaar

Neither of those.

([email protected])4> [begin timer:sleep(5000), length(processes()) end || _ <- lists:seq(1, 10)].
[132,132,132,132,132,132,132,132,132,132]

(no processes created for about a minute)

And the machine is my computer… super quiet and peaceful, I promise :)

elbrujohalcon avatar Apr 03 '17 14:04 elbrujohalcon

you can certainly try it in your computers as well… canillita is open-source

elbrujohalcon avatar Apr 03 '17 14:04 elbrujohalcon

Well, i see there's missing tests! i might volunteer later ;)

ruanpienaar avatar Apr 03 '17 14:04 ruanpienaar

It looks like supervisor:which_children is a gen_server:call with timeout set to infinity under the hood. The only question is: why doesn't it return?

arkgil avatar Apr 03 '17 15:04 arkgil

I suspect this line might be a problem. Canillita doesn't start its top level supervisor, it returns its own PID in start/2 application callback, so call to application_master:get_child/1 returns application master itself. Application master expects only certain messages (https://github.com/erlang/otp/blob/1526eaead833b3bdcd3555a12e2af62c359e7868/lib/kernel/src/application_master.erl#L347) and discards any message it doesn't understand. Since we're waiting infinitely, there is a timeout on the side of gen_server:call.

In this case I'd recommend adding dumb top supervisor to canillita, if that wouldn't interfere with its design.

You can build erlangpl without failing plugin though. Clone the repo and follow these steps:

  • make ui
  • make rebar
  • make
  • rm -rf deps/epl_st
  • uncomment this line
  • ./bootstrap

This should build erlangpl without plugin which causes the crash. Let me know if you encounter any more problems 🙂

arkgil avatar Apr 03 '17 15:04 arkgil

We're working on the plugin system which would allow builidng erlangpl with only selected plugins a breeze, so such cases will be much easier to troubleshoot.

arkgil avatar Apr 03 '17 15:04 arkgil

Well, @arkgil … almost any app built with sumo_rest (like Canillita) will not have a supervision tree. They just don't need one.

I understand that returning self() might be incorrect. What would be the idiomatic OTP-ish way to implement an app that doesn't require a supervision tree, at all?

elbrujohalcon avatar Apr 03 '17 15:04 elbrujohalcon

I wouldn't call myself an authority when it comes to OTP, but in such cases I always start dumb dangling supervisor. Of course we could check if returned PID is not the same as application master's PID, but even Erlang Manual mentions that start/2 should return top supervisor's PID. Also all standard OTP applications follow this behaviour.

I'm trying to run observer with canillita to se how it handles such cases.

arkgil avatar Apr 03 '17 16:04 arkgil

Good points, @arkgil. I'm looking for the right way to solve this issue with my colleagues. I agree that returning {ok, self()} is not the way to go.

elbrujohalcon avatar Apr 03 '17 16:04 elbrujohalcon

Setting up dummy supervisor isn't the most OTPish way too. In the end, why would we need one if there is nothing to supervise?

arkgil avatar Apr 03 '17 16:04 arkgil

Would it help in anyway to be a bit more rebust? i've seen people do the {ok,self()} thing more than once.

ruanpienaar avatar Apr 04 '17 07:04 ruanpienaar

We might check if pid returned by application_master:get_child/1 is not the same as application master's pid. But it's only one of the cases. As a placeholder somebody could start a process which loops infinitely and discards all messages, which will result in the same crash as reported in this issue.

I'd be very happy to fix this, so if you have any idea, please let us know here 🙂

arkgil avatar Apr 04 '17 07:04 arkgil

Sure, any suggestions?

On the other hand I'm glad we dscovered this issue in Canillita. It manifests in obscure way, but indicates lack of OTP compliance.

michalslaski avatar Apr 04 '17 07:04 michalslaski