funnel
funnel copied to clipboard
Explore stateless Funnel/TES
The Task Execution Service in the ga4gh is a service with a few endpoints that can provide a thin wrapper over one or more backend job schedulers.
Funnel looks to be that plus a UI plus multiple implementations of backend job schedulers and connectors for existing job schedulers.
What's the roadmap? How should this fit with ga4gh and TES?
Possibilities:
- TES is a toy reference implementation that isn't meant to be used for real work; for real work, look to other implementations OR
- TES in the ga4gh repo becomes a useable implementation with plugins for multiple backends, including a local runner, major cloud providers, and common job schedulers.
For UI and Funnel:
- ga4gh later adds a UI that calls TES plus related services; ie, maybe Funnel turns into this OR
- Funnel is a separate project with a UI; it relies only on TES for backends OR
- Funnel is a separate implementation of TES + a UI that's useable. The ga4gh TES remains a reference implementation that isn't meant to be useable for any real work.
TES is a toy reference implementation that isn't meant to be used for real work; for real work, look to other implementations
Correct.
TES in the ga4gh repo becomes a useable implementation with plugins for multiple backends, including a local runner, major cloud providers, and common job schedulers.
This is Funnel.
Funnel is a separate implementation of TES + a UI that's useable. The ga4gh TES remains a reference implementation that isn't meant to be useable for any real work.
This one.
At least, that's my understanding. @kellrott can confirm.
In this case, I think of Funnel as 3 distinct components that could be refactored into 3 separate github repos:
- a robust stateless TES implementation with support for multiple backends; this could eventually become the new ga4gh reference implementation with the community contributing more and better backends
- a standalone stateful implementation for GCP only, which duplicates the queueing and task history aspects of the Pipelines API and adds more features like folder copying and multiple Dockers in sequence
- a stateful UI for workflows that can use any TES implementation
@kellrott, what do you think?
With this kind of separation, I think we might be able to get some folks from Google/Verily to contribute to one or more of the components. And component 1 would be something that we might get other groups to contribute code to as well.
On Fri, Mar 17, 2017 at 2:20 PM Alex Buchanan [email protected] wrote:
TES is a toy reference implementation that isn't meant to be used for real work; for real work, look to other implementations
Correct.
TES in the ga4gh repo becomes a useable implementation with plugins for multiple backends, including a local runner, major cloud providers, and common job schedulers.
This is Funnel.
Funnel is a separate implementation of TES + a UI that's useable. The ga4gh TES remains a reference implementation that isn't meant to be useable for any real work.
This one.
At least, that's my understanding. @kellrott https://github.com/kellrott can confirm.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ohsu-comp-bio/funnel/issues/47#issuecomment-287474202, or mute the thread https://github.com/notifications/unsubscribe-auth/AAiXqZkpsHnVsQaM_VcGm-x-jIrPFk3nks5rmvkrgaJpZM4MhGqK .
For additional context, this discussion started here: https://github.com/ohsu-comp-bio/funnel/issues/40#issuecomment-287411950
I think it's a cool idea, and exploring different approaches to TES is healthy (and fun!)
Some questions, so that I better understand the idea:
- a robust stateless TES implementation with support for multiple backends; this could eventually become the new ga4gh reference implementation with the community contributing more and better backends
Could you clarify the term backend? Does a backend conform to an interface which schedulers/UI can use? Or are scheduling/UI backend specific?
How would the ListJobs endpoint work in a system built on Google Storage and Google pub/sub?
I'm a little fuzzy how a UI works when the task state is spread all over object storage. Is there something collecting information into a dashboard database?
Getting into some Funnel specifics here, the TES API is implemented here: https://github.com/ohsu-comp-bio/funnel/blob/master/src/tes/server/task_boltdb.go
The rest of the Funnel server code is really about scheduling and worker management. This file implements the internal scheduler API and database: https://github.com/ohsu-comp-bio/funnel/blob/master/src/tes/server/scheduler_service.go
I'm still working on cleaning up the interface between those two components, but it's basically "get some queued jobs" and "update a running job". Roughly:
type TaskDatabase interface {
// GetQueuedJobs returns up to "n" jobs from the queue
GetQueuedJobs(n int) []ga4gh.Job
// UpdateJobState updates the state of a job
// This is called during a worker state sync
UpdateJobState(to ga4gh.State)
// UpdateJobLogs merges the given log with the existing job logs
// This is called directly by the worker
UpdateJobLogs(log ga4gh.JobLog)
Point being, with some careful thought about those update methods, your idea might fit into the funnel code without much trouble.
One idea would be to get the relatively heavyweight stdout/err logs out of the database. Perhaps if the TES API allowed stdout/err logs to be a URL instead of a string, an implementation could point directly to a "gs://" URL. This might also relax the size limits around these logs and resolve the debates we've been having on this topic (should we stream logs? are they too heavy?).
Also, without logs in the database, it's easier to run a Funnel server in App Engine + Datastore for cheap.
Answering a question earlier in the thread, Task Execution Schema ( https://github.com/ga4gh/task-execution-schemas ) is the protocol, Task Execution Server ( https://github.com/ga4gh/task-execution-server ) is the reference implantation. Both have the acronym TES with is confusing. The confusion and our desire to start adding in more features led us to 'fork and rebrand' to Funnel. The GA4GH organization likes to have simple reference implementations to support the schemas, the Funnel/reference relationship is similar to that of Dockstore ( https://github.com/ga4gh/dockstore ) and the Tool registry reference implementation ( https://github.com/ga4gh/tool-registry-reference-implementation ). One is the 'simple, easy to understand, but doesn't have many features' and the other is 'more complicated and branded' version.
@buchanae -- by backends, I meant job schedulers like local execution, SGE, Slurm, Pipelines API, AWS Batch, Azure Batch. A lightweight stateless TES implementation would be a straight pass-through to the job scheduler, and would only be obligated to support whatever the underlying backend scheduler supports. Eg, SGE provides no historical job database.
This could be simple enough to warrant inclusion in the ga4gh reference implementation. If the backend wrapper introduces a little bit of extra functionality, that would be ok as long as it remains sufficiently lightweight.
For a heavier, stateful implementation specific to a backend (eg, Funnel open-source alternative to Pipelines API), it's fine to have a database, queue, pub/sub, or whatever's needed. A backend-specific heavy-weight implementation wouldn't go in the ga4gh repo though. The authors would follow the Funnel model and create a separate repo.
For UI, it ought to be possible to create a standalone application that can be configured to point to any TES implementation -- the reference implementation, which might support multiple backends if it followed the proposal above, or a specific implementation, like Funnel.
UI is tough though, since there's a tendency to customize it to a particular group's needs. That said, tools like Airflow and Spark are open-source, cross-platform, and come with monitoring UIs. Most people don't bother to implement a custom UI on top of them. Maybe a good enough TES UI could serve the same purpose.
@jbingham Gotcha. Thanks for the clarification.
I'm interested to see how a lightweight implementation for SGE, Pipelines API, etc. turns out. How will you download input files to the worker and upload the outputs? Is that maybe a bash wrapper script with scp/gcloud? Or a fuse mount?
Are there any parts of the funnel code which will be helpful? Possibly the worker code, depending on how you do input/output download/uploading. Or possibly we could extract some task message validation. Otherwise, most of Funnel is about scheduling.
The simplest version would be to not anything about moving files. Just leave it up to each backend what kind of file URLs it accepts. For SGE, that means only local file paths. For Pipelines API, it means only gs:// paths. Funnel could be another backend, and Funnel can support gs:// and possibly others. If people need to move files first, it's not a big deal to do it before calling TES. For a reference implementation, simple seems good. Wdyt?
Q: Does Funnel's scheduling include queueing for quota availability?
The simplest version would be to not anything about moving files. Just leave it up to each backend what kind of file URLs it accepts. For SGE, that means only local file paths. For Pipelines API, it means only gs:// paths. Funnel could be another backend, and Funnel can support gs:// and possibly others. If people need to move files first, it's not a big deal to do it before calling TES. For a reference implementation, simple seems good. Wdyt?
Sounds good.
Q: Does Funnel's scheduling include queueing for quota availability?
Could you clarify "quota availability"?
Here's a brief overview of scheduling in Funnel:
RunTask adds the task to a queue in the database. Every N seconds, a M tasks are pulled off the queue by the scheduler and passed to the scheduler backend.
There are a couple types of backends: heavy vs light
The GCE backend is heavy. It tracks workers, matches resources, picks the best fit, etc. It also has code to scale up the workers (i.e. create new instances).
The HTCondor backend is light. It doesn't track workers or match resources. It calls condor_submit
to create a new worker for every job. Resource requirements are passed to condor. The worker shuts down after the job has finished.
I'm on the fence about whether lighter is always better for Funnel scheduler backends. We've discussed trying light backends for Kubernetes and Swarm. If you wanted one VM per job, you could even have a light backend for GCE. SGE and Slurm would probably be similar.
I see. Does Funnel maintain its own queue in order to reuse VMs? For task dependency, so you can wait until earlier tasks complete? Or because there might not be enough VM quota available in the Google cloud project, and you get a quota exceeded error? The last is what I meant about quotas.
Yes, it maintains a queue. There is no task dependency. If a job can't be scheduled to a worker (maybe because of a full quota), the task remains in the queue.
Got it.
On Wed, Mar 22, 2017 at 11:28 AM Alex Buchanan [email protected] wrote:
Yes, it maintains a queue. There is no task dependency. If a job can't be scheduled to a worker (maybe because of a full quota), the task remains in the queue.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ohsu-comp-bio/funnel/issues/47#issuecomment-288495266, or mute the thread https://github.com/notifications/unsubscribe-auth/AAiXqWeG2qywM927xYS2AlA3AmoBadhKks5roWhMgaJpZM4MhGqK .
@jbingham Should we change the title here to "Allow stateless server backend" ?
I've been thinking more about stateless Funnel/TES. I think it would be a really cool if a user could download a Funnel/TES client, have zero server-side setup, and manage their tasks on a wide variety of clusters.
Cool. Are you imagining the TES/Funnel client being a service, library, or CLI?
I think I'm imagining a client, mostly because that simplifies the installation, but I haven't thought through the details. What do you think?
Actually, if it's a client, it could easily be a library as well.
Great! I'm imagining exactly the same.
There could be a library, with all of the backends included. Then the client could call any backend. There could be a lightweight TES server that can be configured to use any of the backends using the library.
Wanted to note that while progress is slow, this is a topic that we seem to come back to frequently, and we're slowly making progress towards it.
#144 separates task state from worker state, which should allow task runners to exist on their own without being coupled to a worker, database, etc. A file-based TaskService implementation would allow a task runner/worker to read/write task data from a file instead of a gRPC connection.