SISRS
SISRS copied to clipboard
Port to Python
@reedacartwright and I wanted to start a conversation about possibly porting the bash portions of SISRS to Python. We both feel this would make it easier to maintain in the long run, but the work to do it may certainly be nontrivial. This is something I could possibly do myself or at least come up with a process whereby we all do it incrementally. Initial thoughts @rachelss?
Advantages
- Easier for new people to work on. One common language.
- Easier to test (there are lots of great testing tools for Python)
- Could create a plugin system for adding more assemblers and aligners
- Easier argument parsing
- Easier integration with other tools/the web
Hurdles
- Piping isn't as easy as bash
- Verifying nothing is broken (should probably do #37 first)
We could use Cython to create Python modules for the C++ libraries that don't have Python modules already available. The process is pretty straightforward for creating a Python module from a CPP library with Cython:
- Setup a project in Python that uses
distutils
andcython
modules and specify the CPP files - Create a Python file which basically defines the implementation of the CPP class being used. The external CPP class exposes whatever functions it needs to perform the computations in Python. For instance, it might have a function called
getSingleEndMapping
which would take in file arguments and return a data buffer. - A Python file that handles all the modules and would most likely parse the command line options I think.
There are two issues that stand out to me though:
- Passing data between CPP and Python can be tricky and error-prone
- Not sure if some of the libraries (NextGenMap) build any libraries to link against, so might need to compile it ourselves.
I think it'd be interesting to try as I think it would help organize all the different dependencies and allow for more OOP design which would help #36
I am skeptical that cython would help us out here. The best candidate to be ported to Python is the Bash-based front end, and we don't need cython for that. Unless I missed something.
On Tue, Oct 10, 2017 at 3:55 PM, Zach [email protected] wrote:
We could use Cython to create Python modules for the C++ libraries that don't have Python modules already available. The process is pretty straightforward for creating a Python module from a CPP library with Cython:
- Setup a project in Python that uses distutils and cython modules and specify the CPP files
- Create a Python file which basically defines the implementation of the CPP class being used. The external CPP class exposes whatever functions it needs to perform the computations in Python. For instance, it might have a function called getSingleEndMapping which would take in file arguments and return a data buffer.
- A Python file that handles all the modules and would most likely parse the command line options I think.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rachelss/SISRS/issues/36#issuecomment-335631781, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGOHtHDhJDWShsViGEAJav7qEhWkoB9ks5sq_XGgaJpZM4PYSgW .
This is what I had in mind. It's still a work in progress (still trying to figure out the NGM api), but it would allow us to use NGM as a Python module, and that's what I think might help in the process of porting the BASH code to Python. As far as I know there isn't a NGM Python module (which is also the case for Bowtie and some of the assemblers I think).
@rachelss, what is your opinion about migrating the main bash script to python?
With Whitezed mostly wrapping up, I'm in a good place to start working on this.
Looking long term I like @zmertens idea of potentially creating Python wrappers for some of the C++ assemblers and other projects we're using. The advantage is that it let's users automatically install these dependencies. The problem is we would then have to maintain packages (probably conda packages) for these softwares. It's something to look into though.
For now it's fine to just invoke them through subprocesses.
Anyone ever used Luigi? Looks interesting. I'm checking it out now.
Converting dependent programs to Cython is not going to provide enough of a performance boost to merit the effort involved. The majority of time is spent inside the external programs and reducing process invocations is not going to merit any noticeable increase.
@reedacartwright, @zmertens, and I discussed wrapping offline and decided that the programs in question are common enough in the community to require separate installation, rather than trying to maintain python wrappers ourselves.
@rachelss would it be ok to target Python 3 with the port?
Yes we need a complete upgrade to python3. Python 2 is now officially on its way out.
Excellent. Do you happen to know which version of Python 3 it's safe to require of the intended users?
I'm not sure. Can we use 3.6.3? What can we do to ensure backwards compatibility? Or to throw errors if someone is running an old version and there's a compatibility issue? Can we use Docker containers to minimize issues?
I agree that the programs we utilize are common and likely to be installed already by most users of SISRS.
3.6 would be fantastic. That includes every feature I might want to use. It really depends on what you want to support. It's a balancing act between using modern language features and supporting users with old systems. We can certainly aim for 3.6 and issue warnings for anyone using an older python. However, the only real solution for those users would be to install something like Miniconda. In my opinion everyone should be using it anyway, but your users might not agree.
@rachelss @BobLiterman just wanted to give an update on this. I've been steadily working on the port, and so far I've basically implemented alignContigs
, identifyFixedSites
, and outputAlignment
. Should be on track to pretty much wrap up sites
functionality fairly soon. I'm also adding high level tests as I go to make sure each command is basically working the same as bin/sisrs
(based on the excellent small dataset @BobLiterman provided). I decided to stick with Python 2 for now to avoid changing the current scripts too much. Also the library I'm using for argument parsing (Click) doesn't seem to like python 3 yet. I'm actually thinking about changing away from Click though. Let me know if you have any questions or requests.
Oh I've also created a Dockerfile for development and running the tests. This solves the CI issues I was working with @BobLiterman on a while back (see #37). It also should make it much easier for users to get SISRS up and running with all the dependencies. All they need is docker installed. You can see all my changes on this branch: https://github.com/anderspitman/SISRS/tree/python-port
I have been tinkering with some memory reduction adjustments (mainly to getPrunedDict and outputAlignment) and the Python port will really help streamline things.
Don't worry about incorporating these into port v1, as they are easy to plug in down the road, I just wanted to show you what was new on our side. https://github.com/BobLiterman/SISRS/tree/MemorySaver?files=1
@rachelss @BobLiterman I've completed porting of alignContigs
, identifyFixedSites
, outputAlignment
, and changeMissing
. The code isn't super clean (and still shells out to unix tools a lot more than it needs to), but I do have integration-level tests in place so refactoring shouldn't be too dangerous.
At this point I think I'm about ready to officially merge in the python port in some capacity, and it's probably time to start discussing what route forward we want to take. A few thoughts I've had:
-
We could simply merge in my branch and have the bash/python versions live side-by-side (currently I have it set up so pip installs both
sisrs
andsisrs-python
scripts). My concern with this though is that if a user runs into a missing feature in the python version, they'll simply switch to the bash version. Or just never use the python version in the first place. -
One alternative would be to put the python code in its own repo. We could call this
py-sisrs
or something like that. Bumping it to SISRS2 might be a better option, and would probably help with user adoption. -
In order to really get this port off the ground, we're going to need at least one new killer feature. No one (including you guys, I would guess) is going to use this thing if it doesn't bring anything new to the table, because right now the only real advantages have to do with an improved development experience and more maintainable code. Plus there are still going to be a lot of missing features and other warts in the short term. So can you guys think of anything I could implement that would convince you to start using the python port today, in spite of the drawbacks? Maybe running on multiple nodes?
My goal at this point is that any new features get added to the python port, rather than the bash script. So really my question is what do I need to do to get the port to a point where that's feasible in your eyes?
What about something like a 'scheduler mode' where if the user was running SISRS on a multi-node cluster with a scheduler (Torque, SLURM, etc), the program could auto-generate scripts to be submitted (based on a template script supplied)? Then, steps like Bowtie could be run in parallel through automated script submission? Just spitballing things I've thought of.
Hey @BobLiterman, I've been reading up on distributed computing, since that does seem like an obvious big feature to add. For your proposed scheduler mode, would we need to implement compatibility with multiple schedulers, or would SLURM be sufficient? Also, SISRS uses pretty large data files. How is the data normally distributed for computations like this?
Theoretically, one could provide a dummy script for whatever scheduler they have, and the SISRS script could use that to generate multiple HPC scripts, independent of scheduler. User supplies dummy script and command [sbatch, qsub, etc] to submit scripts as an argument.
In terms of data, we may be able to generate some guidelines based on the data to estimate final data sizes?
That scheduler script conversion would be slick, but likely a lot of work. You're essentially transpiling between multiple languages. Getting something working would probably be pretty straight-forward, but you could get killed by the corner cases. Not saying it isn't worth doing.
For data, my question is more general. If I do a run of SISRS on my local machine, all the input data is there on my hard drive. If I'm trying to distribute that work across nodes, some or all of that data is going to have to be made available to those nodes. How does that usually work? I have very little cluster programming exposure.
A) Scheudler scripts just make system calls, so it really wouldn't be hard to implement. I actually do it myself manually now. I have a script generator I can share to give you a sense of it.
B) Data on a cluster is also stored such that it's accessible by all nodes typically. No worries there.
A) ok maybe I'm misunderstanding the nature of the problem. I'd like to take a look at that generator for sure
B) excellent
Emailed. Couldn't attach here from phone
Thanks!
@BobLiterman interacting with job schedulers seems doable, but I'm concerned it might not be the biggest bang for the buck. The risk we run is the entire python port never being used. If SISRS is going to go away in the not-to-distant future, then this isn't too big of a deal. But I think we have bigger aspirations for the project, and I really think SISRS will be much more maintainable and extendable if it's written in Python. What I'm trying to do right now is get the port over that last hump into viability. But since I don't use SISRS on a daily basis I can't say what the best way to do that would be. Do you have any other ideas? @rachelss any input?