dask-gateway icon indicating copy to clipboard operation
dask-gateway copied to clipboard

Log provider API

Open jcrist opened this issue 5 years ago • 12 comments

Sometimes a user may want to look at logs of completed workers/clusters. Right now all log handling is backend specific - users need to be familiar with the cluster backend and the particularities of how logs are handled for that backend. For example, YARN logs are stored to HDFS and can be accessed with the yarn cli tool.

It may be useful for dask-gateway to provide a LogProvider class that different log backends could implement. This might look like:

class LogProvider(LoggingConfigurable):
    def get_logs_for_cluster(self, cluster_name, cluster_state):
        """Get the logs for a completed cluster

        Parameters
        ----------
        cluster_name : str
            The cluster name.
        cluster_state : dict
            Any backend-specific information (e.g. application id, pod name, ...)

        Returns
        -------
        logs : dict[str, str]
            A mapping from job id to logs for that job.
        """

    def get_logs_for_worker(self, cluster_name, cluster_state, worker_name, worker_state):
        """get the logs for a completed worker"""

I'd prefer dask-gateway doesn't manage the storage of these logs (although we could if needed), rather it should be an abstraction around accessing the logs wherever they're being held by some other service/convention.

Possible implementations for our cluster backends:

  • YARN: this is hard as YARN has no java api for this, but we could hack something up
  • Jobqueue: filesystem backed, logs could be stored in ~/dask-gateway-logs per user, or in some directory managed directly by dask-gateway?
  • Kubernetes: I'm not sure? There's lots of possible services people might use for logs here. Stackdriver perhaps?

jcrist avatar Aug 23 '19 21:08 jcrist

cc @quasiben, @jacobtomlinson - any input on common logging backends for kubernetes clusters? We don't need to pick a specific one for dask-gateway to use, I just want to make sure our abstraction is general enough to support whatever people may want to use.

jcrist avatar Aug 23 '19 21:08 jcrist

Also cc @yuvipanda for the above kubernetes question.

jcrist avatar Aug 23 '19 23:08 jcrist

I have used stackdriver but only from the UI. There are several k8s logging backends as you've noted but I haven't played much with them so I don't have much input. We could try this out first with what pangeo folks need/want?

quasiben avatar Aug 26 '19 15:08 quasiben

That was my plan. Do you know if they're using anything anything for logging yet? cc also @jhamman.

jcrist avatar Aug 26 '19 15:08 jcrist

There is a logs() method on Cluster implementations which returns a dictionary of logs. The default behavior is for the Cluster to query the scheduler and workers via the dask comms who can then pass along their logs.

The intention with SpecCluster implementations is that they override this with a more native implementation because failure states like tracebacks cant be captured by querying the services directly.

In dask-kubernetes this works by querying the k8s API for the log records (these can be shipped off to stackdriver or other services but are always available form the API service too).

In ECSCluster in dask-cloudprovider it is done by querying AWS CloudWatch.

I had imagined that dask-gateway would query this method on the Cluster objects and pass the results along. And therefore the challenges with yarn would be contained in dask-yarn rather than in the gateway.

jacobtomlinson avatar Aug 27 '19 09:08 jacobtomlinson

Dask-gateway can't use any of the existing dask infrastructure due to differing requirements. I'm familiar with the logs method as described above.

To clarify - the question here for kubernetes is how to get logs from deleted pods, where the logs must then be hosted on a separate service since they'll no longer be available via the kubernetes api (AFAIK?). I'm wondering what services people use for this - stackdriver being one option.

jcrist avatar Aug 27 '19 11:08 jcrist

Dask-gateway can't use any of the existing dask infrastructure due to differing requirements. I'm familiar with the logs method as described above.

This seems like a slight concern here as surely it will result in duplicated effort. It was my understanding that dask-gateway uses the various cluster implementations under the hood, have I misunderstood?

since they'll no longer be available via the kubernetes api (AFAIK?)

Once a pod has been terminated there is a grace period before it is garbage collected. If you list pods in the API you will see terminated pods. AFAIK you can still request their logs until they get cleaned up. I am trying (and failing) to find docs on the default garbage collection interval.

I would worry about trying to integrate with third parts log collection services as there are a large number of them.

jacobtomlinson avatar Aug 27 '19 12:08 jacobtomlinson

It's an unfortunate necessity right now. Dask-gateway needs to do many things that the cluster managers don't, and needs to do so in a way that's efficient and secure for multiple users. For example:

  • Start clusters securely for a different user than the launching user
  • Track cluster and worker failures and timeouts through their whole lifecycle, and do so in a way that's efficient for many active clusters
  • Allow clusters/workers to persist through gateway process restarts
  • Serialize any relevant state for each process to a database
  • Support configuration of all of the above through traitlets

The infrastructure here looks a lot more like jupyterhub than the existing dask cluster managers. Perhaps sometime in the future we can find ways to share code, but for now reimplementing was way more efficient (and really didn't take that much time).

I would worry about trying to integrate with third parts log collection services as there are a large number of them.

Yes. The main goal here is to pick a good first one and use it as a case study when designing the LogProvider class abstraction. I don't expect dask-gateway to natively support all logging backends, I just want to make sure the plugin point we provide can support them if someone wants to add support for one.

Do you have a suggestion for a common logging provider (or a few) used by kubernetes users?

jcrist avatar Aug 27 '19 12:08 jcrist

Perhaps sometime in the future we can find ways to share code

I think putting effort in to achieve this would be valuable in the long run.

Do you have a suggestion for a common logging provider (or a few) used by kubernetes users?

Each public cloud provider has their own, this may be a reason to avoid them for now. I would imagine the ELK stack is a popular place to send logs from in-house clusters, or perhaps greylog or splunk.

jacobtomlinson avatar Aug 27 '19 12:08 jacobtomlinson

An in-cluster ELK stack could handle this. Dask Gateway could then query ES and/or provide a link to the relevant query in Kibana. Tons of added complexity though. It might be best to disable it by default but make it available for users willing to incur the operational overhead.

droctothorpe avatar Nov 30 '20 15:11 droctothorpe

After a year of letting this linger, I'm leaning towards the following:

  • Writing up a generic LogProvider class allowing for alternative implementatons and customization by users as needed
  • Writing a KubernetesLogProvider class that uses the k8s logging api to pull logs from containers
  • Modifying dask-gateway to not delete stopped worker pods immediately, but after some configurable period (not sure what the default should be, perhaps only delete stopped pods on cluster deletion?)
  • Modifying dask-gateway to not delete stopped cluster pods immediately, but after some configurable period (not sure what the default should be)

This lets us hit the usual requirements of debugging "why did my worker/cluster die" without additional complexity of custom logging backends

jcrist avatar Nov 30 '20 15:11 jcrist

Hi. This has been here for a while, so if there has not been progress on it, I wonder if a general strategy to interrogate worker pod logs has emerged as a stop-gap?

jpolchlo avatar Jan 05 '23 17:01 jpolchlo