toil icon indicating copy to clipboard operation
toil copied to clipboard

Replace SDB with S3 in AWS job store

Open hannes-ucsc opened this issue 8 years ago • 19 comments

Quoting an AWS support professional in case 1767267511:

I would recommend seeing if you would consider DynamoDB to replace your SimpleDB solution. DynamoDB is essentially the successor of SimpleDB, which is slowly being pulled out from active development. In fact, we're no longer offering that to new customers at this point.

If Toil should run on newly opened AWS accounts, we need to phase out SimpleDB.

I propose that we create a new, second implementation of the AWS job store that uses DynamoDB. The new implementation should be accessible under the aws job store locator, while the old one becomes aws_old.

The reason I didn't use DynamoDB in the first place was the payment model, which is based on a flat rate as a function of a configurable ("provisioned" in Amazon lingo) request volume. Toil would have to set that request volume to user-specified value (with a sensible default) before a workflow starts and make sure that it configures it back to the lowest possible value on exit.

┆Issue is synchronized with this Jira Story ┆friendlyId: TOIL-350

hannes-ucsc avatar Jun 10 '16 17:06 hannes-ucsc

We should keep an eye on this, if they start deprecating SDB we need to start on a Dynamo job store replacement.

cket avatar Dec 01 '16 00:12 cket

Any chance that this can be revisited?

abatilo avatar Jan 31 '21 01:01 abatilo

@abatilo s3 is now strongly consistent, and so this issue is now about replacing SDB with s3. This will probably be worked on relatively soon actually (sometime in the next few months).

DailyDreaming avatar Jan 31 '21 02:01 DailyDreaming

Would it be a big lift? I would be curious to know if I could help.

abatilo avatar Jan 31 '21 02:01 abatilo

@abatilo Medium sized, I would guess? It still needs to be explored.

Most of the work would involve removing the current sdb functionality, identifying everything it's shuttling back and forth (primarily items with job attributes, representing jobs to be processed), and then making the remapping that will fetch/put files into s3. Jobs would map to job files in s3, and the presence of one signifies a job yet to be run, and a job that has finished should no longer have a file. Most of the work will be in the https://github.com/DataBiosphere/toil/blob/master/src/toil/jobStores/aws/jobStore.py file.

Some examples:

Loading a job currently uses a jobstore id to key the attributes for a job out of sdb: https://github.com/DataBiosphere/toil/blob/master/src/toil/jobStores/aws/jobStore.py

This would need to be changed to using the jobstore id to fetch a bucket file by bucket name (aws jobstore name) and key.

Same with deleting a job: https://github.com/DataBiosphere/toil/blob/master/src/toil/jobStores/aws/jobStore.py or listing jobs: https://github.com/DataBiosphere/toil/blob/master/src/toil/jobStores/aws/jobStore.py#L322

There are also some odd spots where not finding a job needs to be handled specific to sdb, for example: https://github.com/DataBiosphere/toil/blob/98dbf33147ed029800dd43a73ee5c64e83feda7b/src/toil/leader.py#L973

If you want to tackle this, or a portion of it, we'd be happy to have the help and I'd be glad to review code progress on this as well.

DailyDreaming avatar Jan 31 '21 04:01 DailyDreaming

@abatilo We have sprint planning tomorrow and I'm going to propose putting this into the upcoming sprint.

DailyDreaming avatar Feb 10 '21 06:02 DailyDreaming

That's awesome. Thank you

abatilo avatar Feb 10 '21 14:02 abatilo

@DailyDreaming Could we still consider DynamoDB? S3 has throughput limits which might become problematic.

abatilo avatar Feb 17 '21 18:02 abatilo

@abatilo Yes, that's certainly still a possibility. What kind of limits concern you? First hit searching indicates 3500 requests/second to PUT data, and 5500 requests per second to GET data on s3. I'm not sure we're going to be hitting those limits, though it does look like dynamodb has higher limits.

DailyDreaming avatar Feb 25 '21 18:02 DailyDreaming

Members of my informatics team have expressed to me that with the current usage of S3, we've had pipelines fail due to hitting S3 limits. I haven't had time to dig in yet but that's why I wanted to bring it up here.

abatilo avatar Feb 27 '21 15:02 abatilo

I see. The database is more to enforce strong consistency, so I'd have to investigate how much the rate will increase (which I suspect would mostly be from heading a file to check for existence, rather than checking the db).

DailyDreaming avatar Mar 03 '21 17:03 DailyDreaming

➤ Adam Novak commented:

Since S3 is strongly consistent now, we’re planning to just use that and not DynamoDB.

unito-bot avatar Jan 13 '22 17:01 unito-bot

Will it be possible to use other S3 backends than AWS?

stain avatar Jan 19 '22 10:01 stain

Hello,

It would be interesting to get rid of the amazon dependency to be able to use on-premise kubernetes platforms.

And therefore to replace sdb with something other than an amazon solution like dynamodb.

Would it be possible to consider solutions like Redis, etc.?

Regards

Guigzai avatar Feb 01 '24 18:02 Guigzai

➤ Adam Novak commented:

Lon is making a cool control flow diagram for this.

unito-bot avatar Feb 13 '24 18:02 unito-bot

We've been following this issue for a long time, hoping that using a strongly consistent S3 backend as mentioned by @unito-bot would be adopted.

Specifically we'd like to use Ceph's S3-compatible object storage, which guarantees strong consistency. Deploying Ceph is a common cluster storage solution for on-premises Kubernetes, since the Rook operator does the heavy lifting.

davidjsherman avatar Feb 15 '24 08:02 davidjsherman

We have Ceph now at UCSC, and using Ceph directly (instead of through the shared filesystem) might be interesting.

adamnovak avatar Apr 30 '24 17:04 adamnovak

What could we (at Inria) do to contribute?

davidjsherman avatar May 01 '24 14:05 davidjsherman

Lon will probably be the one who would work on this, though it will be a while before this is added to the sprint. We don't have many internal people using the AWS implementation so we haven't had much spare development time for this.

We have a vague idea on implementing jobstore plugins similar to batchsystem plugins, so any ideas/recommendations there can be helpful.

Community contributions are of course always welcome. Unfortunately I'm unsure where those contributions could go, as this is Lon's task and I'm unsure of its current progress. If you want, you could ping him and ask where contributions for this could go.

stxue1 avatar Jun 19 '24 02:06 stxue1