fdk-java icon indicating copy to clipboard operation
fdk-java copied to clipboard

Debugging in the cloud

Open zootalures opened this issue 8 years ago • 0 comments

The Java SE docs list a number of troubleshooting tools: the debugger, JFR, various profiling tools. We might bring any number of these to bear. I think there are a few axes we can look at: local / deployed in the cloud; live / snapshot / post-hoc; "plain" java fdk and a flow invocation; and using a cooperative container versus an unadorned one (together with soem kind of sidecar).

There are sufficient reasons why we might want to extend the debugging capabilities to a cloud-deployed function: they may be plumbed in against different persistence apps; we might have connectivity trouble; we might have strange behaviour with a particular client not seen in local testing, etc.

Live debugging, locally / in the cloud

The scenario here may be complicated because a developer's machine is sitting behind a firewall/proxy. Additionally, we don't necessarily want any Tom, Dick or harry to be able to connect a debugger to our running images (the images might be proprietary), so we'll need this to be authenticated somehow.

On-demand debugging with a cooperative instance, single function call

This is relatively doable - modulo the restriction on an instance lifetime, it can basically be put together without much help from the functions platform - assuming that functions have reasonable outgoing network connectivity.

  • I deploy my app configured (amongst other things) with a key that'll let the instance unlock the debug connection details.

  • I launch my debugger. For a Java debugger, it's possible to set the JVM to make an outbound connection to a listening debugger; this really simplifies this setup.

  • If I'm on the internet, I'm basically done. If I'm behind a proxy server, however, I need to make a tunnel connection to some bouncer that'll forward a connection to my debugger locally. The tunnel server sends me details of how the JVM should connect to it (endpoint address, credentials) and waits.

  • I make a customised call to the functions server to invoke my function. I pass additional information - the endpoint connection details, encoded so that only my app can decode them with its configured debug key.

  • On launch, the presence of the debug header causes the container's entrypoint to extract the connection details and launch the JVM with the debug agent loaded. It connects via the supplied address to the bouncer which forwards the connection to my debugger. Bob's your uncle.

(With a little more sleight-of-hand, debuggers that require incoming connections to the debugged process could be managed also; it's a matter of configuring the tunnel/bouncer service correctly.)

For a hot function, we'd need to ensure that the debugger was correctly turned off / disabled (or the function instance exited after debug) so that other requests that are routed to it don't hit breakpoints that've been left behind.

How this might be improved with assistance from the functions platform

  • being able to selectively turn off process timeouts

  • we can potentially attach a process debugger to a local PID; this has a bunch of downsides (it's read-only and freezes the target process whilst it's connected). A sidecar debugger that can relay connections from a desktop tool might be able to do this with access to the process space.

What this feels like

This is basically a traditional debugging scenario.

Flow

There's a major issue with this, and that's that reinvocations of the function are effectively forks; there are new processes. Whilst we might collaborate with the flow-service in order to deliver the right headers, there are two main problems:

  • firstly, we'd ideally like breakpoints set in the first invocation of a function to persist as far as our debugger view is concerned. We might be able to get away with another "suspend on launch" or similar.

  • secondly - this may well be more of a barrier to usability - each Flow execution is its own process. Do we launch multiple user debugger instances? Multiple processes may be running at once. (How do IDE debuggers cope with forking processes, if at all?)

    Possibly one short-term approach here would be to fire up half a dozen (? or so) debuggers each awaiting an incoming connection, and each listening on a different channel. The bouncer/tunnel would need to target a new debugger instance for each incoming connection. This might prove unusable from a user perspective - would need to experiment.

  • One option here is having a user tool which can collaborate and knows about the Cloud Threads architecture: on a new invocation, any break-points etc. are stored and that configuration is retained. For additional debug connections that come in from new cloud futures, that debug configuration is restored to the new future before it is run.

    The bouncer/tunnel could potentially assist with this by intercepting debugger traffic and inspecting it, keeping some picture of the desired state and relaying it to new instances.

Snapshots

The idea here is that the function is pre-deployed and presumably under some kind of user-driven load.

  • The user asks for a snapshot. This might be down to a number of criteria (breakpoints, other conditions).

  • Additional configuration is supplied / available to an instance. On launch (possibly: later, for hot functions? This would definitely require more assistance from the functions platform to achieve) a function can be configured to start up a debugging agent (note, this can potentially be done with the stock agent). A nearby service (potentially, in-container, with a cooperative image) connects to this and supplies appropriate breakpoints / watch conditions.

  • This setup continues as long as the snapshot request is live. Once a hit is made, the debugger needs to extract salient information and send it via a side-channel to its target repository.

  • For one-shot shapshots, the condition is then marked as no longer live. If we get more than one hit, one snapshot wins (or we collect a bunch of them) but once the trigger is marked as done, future function invocations will not set up the breakpoint condition.

This approach feels quite "cloudy"; it's really appealing. It needs a nearby, fast source of data that the container (resp, the functions platform) itself can configure. It needs a way to rapidly shuttle a snapshot result off the container and into a bucket for later perusal. For an in-container (cooperative) situation, we'd need to ensure that the data is delivered before the container shuts down.

If functions are primarily running "hot" (ie, in relatively long-lived containers) then we may need a way to know where those containers are and to configure them after they have been initially set up. The functions platform would need to cooperate there ("signalling" debug hooks that new configuration is available). Alternatively, every hot container (or a fraction of them) could collaborate with a fast message queue if we were rolling this as something that didn't rely on the functions platform for support.

Pushing config: With some kind of nearby sidecar that's able to attach to the debugger, we'd still want to know which functions are deployed to our local host. We'd need to subscribe to a topic that supplied debug information for them. It'd be helpful if functions had some kind of locality to avoid having to know about every single snapshot request. This might be a more efficient architecture but seems potentially a great deal more complicated.

Given a message bus which we can rapid-fire messages into, similar uses of the same facility (eg, the "ad hoc logging" facility) are variations on this theme.

The second major consideration is what the API for this looks like. We need to be able to get a bunch of requirements from the user (file / line number and condition, at the least) into the platform.

Flow

Ironically, there's practically no additional difficulty in the situation where we're doing live snapshots of a cloud-thread invocation: every function call is treated the same. The main technical barrier here is identifying the right breakpoint spots - this might be complicated by the use of lambdas. Fiddly technical details abound.

zootalures avatar Nov 03 '17 13:11 zootalures