pachyderm icon indicating copy to clipboard operation
pachyderm copied to clipboard

Allow copying/renaming a pipeline/repo without re-executing

Open gabrielgrant opened this issue 7 years ago • 8 comments

What is the goal / desired outcome?

As a user, I want to be able to copy or rename a pipeline or repo. I expect this to be a (very) light-weight operation, and for repo history to be preserved.

What is your proposal for a feature to solve this?

Add a copy-repo command that creates a clone of a repo having only a different name (ie with all history/provenance intact)

A move-repo command would effectively do the same as a copy, but also deletes the old repo and updates direct-downstream pipelines to take the new repo as input. Ideally this wouldn't require running a new, skip-all-datums job on those direct-downstream pipelines at all (this should be the case for pipeline updates too: #3092). Though the effect is the same as a copy-update-delete, I'd probably expect a move to be lighter-weight in practice, if it can just re-write some etcd keys.

Copy and move pipeline commands work similarly: copy/move the output repo, and update the pipeline spec to reflect the name change. As with downstream repos, this ideally shouldn't result in any new job being created.

I don't have a strong opinion on what should happen with the spec repo's branch for that pipeline, but, in the case of a "move" operation, I'd be inclined to just rename it to the pipeline's current name, and have the rename operation be captured in the historical commits. In the case of a copy, I'd assume a new branch would be created for the new pipline, that would point at the last version of the copied-from pipeline as it's latest commit's parent (ie pre-copy historical commits would be shared)

If there is a way to accomplish this today via workaround, what does that require?

The current state of a repo can, in theory, be copied by just copying the files (using pachctl copy-file), though this is challenging due to the lack of a --recursive flag to copy-file (see #3093). This loses history and provenance, though. In theory, historical commits could also be copied, but I'm not sure if it's possible to manually set provenance when creating a new commit for a copy operation. There may also be some way to do this by mucking with the output of pachctl extract

AFAIU right now a pipeline's name is included in it's datums' hashes, so there is effectively no way to copy/rename a pipeline without recomputing everything from scratch. If that check were removed, then extracting a pipeline and creating it again with a different name (but the same salt) should allow it to skip all datums, at least.

gabrielgrant avatar Jul 28 '18 02:07 gabrielgrant

Getting the name of the pipeline out of the computed hash seems like an obviously good idea. How easy would it be to accomplish this in practice?

ajbouh avatar Oct 13 '18 01:10 ajbouh

This would be fairly easy to do, it just comes down to removing a line of code. There will also be some migration pain associated with it but it's manageable. As we've discussed offline though this won't mean that completely disparate pipelines will be able to reuse each other's results. Doing that would be a much larger design decision that I'm not yet convinced would make sense for our system. The ability to rename pipelines would definitely be very nice though, so we'll look to allow that.

jdoliner avatar Oct 19 '18 01:10 jdoliner

We've gotten two more user requests for this in the last week or so. Is this something small enough that we can slot in for 1.9 (as it requires migration so it wont be part of a 1.8.x)?

JoeyZwicker avatar Apr 03 '19 23:04 JoeyZwicker

This is pretty small implementation wise. The main thing is the associated pain of migration. A naive implementation would mean that everything has to be recomputed when users migrate to 1.9 which probably isn't tenable for a lot of people, so we'll need to have some backward compatibility built in.

jdoliner avatar Apr 03 '19 23:04 jdoliner

Are we not already going to have to do that for other 1.9 breaking changes? I dont have a good sense of which things will/wont need to be migrated.

JoeyZwicker avatar Apr 03 '19 23:04 JoeyZwicker

Please add me to the list of users requesting this :)

itssimon avatar Nov 13 '19 22:11 itssimon

I hope you guys consider this one again. Many users have yet to migrate to 2.x which make this a good feature to consider at this time.

RaananHadar avatar Jan 21 '22 18:01 RaananHadar

I think with 2.0 landing this one is now in reach. The major barrier in 1.x was that the pipeline's name went into the hash we used to identify datums. This meant that renaming a pipeline would make the datums seem differently, that behavior has been removed. I haven't tested this but I think it's actually possible to kind of achieve this manually by creating the new output repo, copying over the data from the head of the old output repo. Then extracting and recreating the pipeline with a different name. This can be made more user friendly but assuming that works then creating a rename-pipeline command that implements that behavior is all that's required to get this one working. While we're at it we should probably create a rename-repo command and use that in the implementation of rename-pipeline.

jdoliner avatar Jan 31 '22 22:01 jdoliner