streamparse icon indicating copy to clipboard operation
streamparse copied to clipboard

JARs should be self-contained and not rely on external virtualenvs [or use Storm hooks and get rid of SSH]

Open dan-blanchard opened this issue 10 years ago • 14 comments

I've been looking over the Pyleus code a little, and one thing that they do that makes deployment simpler is that they create the entire virtualenv inside the JAR instead of having it reside on the servers. They don't require SSH at all for anything, because they require people to have Storm installed somewhere on their path, and then they just use the storm command directly and specify the host and port for nimbus to it. I end up installing storm with streamparse anyway so that I can run storm ui, and I don't think I'm alone there.

If we switched to putting everything in the JAR, then we wouldn't have to worry about anything with SSH anymore and could hopefully get rid of our dependency on fabric, since that's not Python 3 compatible (as I keep mentioning :smile:).

dan-blanchard avatar Jan 22 '15 19:01 dan-blanchard

If we switch to putting everything in the jar, wouldn't that imply compiling the venv on the user's machine, which may be different than the deployment target?

codywilbourn avatar Jan 22 '15 21:01 codywilbourn

Yes, but isn't it probably a good idea for people to be developing in a VM that's the same as the deployment target anyway?

dan-blanchard avatar Jan 22 '15 21:01 dan-blanchard

This seems like a great idea to me. As long as it's configurable for anyone that ends up with compilation issues, this would be a nice win for simplicity IMO.

coffenbacher avatar Mar 31 '15 00:03 coffenbacher

I actually just thought of an interesting way we might be able to make this work. Apparently Storm supports hooks that get triggered on certain events. As a part of that each hook can have a prepare method that is called at the time the TopologyContext is put together, which happens before the actual ShellBolt and ShellSpout prepare methods are called. It would take a tiny bit of JVM code, but we could implement a hook that's only purpose was to build the virtualenv from a requirements file we put in the JAR. That way you're always building on the same architecture, and we wouldn't be bloating the JAR size.

From what I can tell, we would just make sure that the virtualenv only gets built once, because TopologyContext.addHook would get called for every component in the topology if we use the topology.auto.task.hooks config setting, which is the simplest way to add hooks.

@amontalenti @msukmanowsky Any thoughts on this?

dan-blanchard avatar Apr 17 '15 15:04 dan-blanchard

The hooks thing seems interesting but I think I'd like to try and ping the Apache team to support a topology-level hook as opposed to us doing some flock stuff to avoid race conditions from multiple components all executing the same code.

No gotchas with this approach and other package managers like Conda right? I haven't used Conda yet.

msukmanowsky avatar Apr 20 '15 17:04 msukmanowsky

It would have to be more than a topology-level thing, since it would have to run on every machine the topology is running on. One way to avoid locals and race conditions would be to make each shell component run in its own independent environment.

If we were using conda, we wouldn't even need to store multiple copies of everything that way, because conda hardlinks packages to each other (and has its own locking mechanism to make sure two conda commands aren't messing with the package index at the same time).

dan-blanchard avatar Apr 20 '15 17:04 dan-blanchard

This is important for me as well, we're using Streamparse to deploy Machine Learning models. Compiling numerical and machine learning libraries during deployment is very painful and takes a lot of time :)

sixers avatar Apr 25 '15 18:04 sixers

+1 for this based on the mailing list request I put in and feedback from @rduplain

westover avatar Aug 27 '15 14:08 westover

This depends on #84 as prerequisite.

rduplain avatar Oct 16 '15 15:10 rduplain

Where does the status on this sit?

westover avatar Jul 10 '17 14:07 westover

This is mostly waiting on pantsbuild/pex#316 being merged. Once that's set, we'll transition from using primarily using virtualenv's to using PEX (see #212). With a PEX, we can ship everything we need inside the JAR. There will need to be a little bit of work done to work around the issue that executable permissions are lost when you create JAR, but the main hold up is PEX not supporting manylinux wheels. Without those, you could not really deploy to a Linux machine from OS X or Windows if one of your project's dependencies needed to be compiled.

dan-blanchard avatar Jul 10 '17 15:07 dan-blanchard

@dan-blanchard can I clarify that the current jar sparse command does not bundle the venv like Pyleus did?

westover avatar Sep 15 '17 02:09 westover

Correct. We currently update the venvs on the workers directly via fabric. The hope is that someday we will be able to switch to using PEX instead, but we're waiting on a wheel support PR being merged there. On Thu, Sep 14, 2017 at 10:57 PM James Westover [email protected] wrote:

@dan-blanchard https://github.com/dan-blanchard can I clarify that the current jar sparse command does not bundle the venv like Pyleus did?

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/Parsely/streamparse/issues/99#issuecomment-329666433, or mute the thread https://github.com/notifications/unsubscribe-auth/AA7l2SSU5gW3TNlsPIfERrRj6hWUyHaFks5sieeUgaJpZM4DV_HF .

dan-blanchard avatar Sep 15 '17 18:09 dan-blanchard

If your interested, my workaround for this is using docker. Its a bit ugly but bear with me.

see here https://github.com/Richard-Mathie/storm_swarm

Basically the idea is to deploy the storm workers as a docker service using docker swarm, mounting a volume for the virtual environments to exist in. You can then have another service which builds the virtual environment for those storm workers to run from.

If you deploy the services as global any nodes you add to the swarm automatically get added to the storm cluster, and your venvs get build.

Building is done using pip and a requiremnts.txt file which is distributed to the nodes using a docker secret (though they have config now as well). Change requirements? then update the docker secrets which will trigger a restart and hence restart of the storm_venv service. Finally I have to disable ssh in streamparse and put dummy entries in to the nodes list so that it can set the number of workers to deploy to.

I am looking forward to the day though when I just have to submit a JAR to the nimbus.

Richard-Mathie avatar Nov 22 '17 15:11 Richard-Mathie