Python package installer (pip) via Conda - Reprise - syntax sugar for inline process directive
As discussed in #4664, Conda already supports installing pip packages using a YAML file; corresponding documentation for was added in #4677 .
However, as @pditommaso mentioned, it would be good to also provide syntax sugar to achieve the same with simple inline specification through process directive.
I am going to outline some options for User Interface below, with some pros/cons, for discussion.
Here are my first thinkings:
- The syntax should explicitly reference its usage of
conda, as there are other ways of installing Pip packages, in particularpipitself and newer solutions such aspixi; - The syntax should be clear and not confusing, ie correctly suggests what it does under the hood: uses conda to install pip packages;
- The syntax should be concise enough, striking a good balance with the clarity requirement above;
- The feature is just a syntax sugar, therefore the implementation should be proportionately simple.
So here are the options that came to mind:
-
the accurate and potentially wordy:
conda = 'pip:numpy pip:pandas samtools fastqc pip:matplotlib'- each Pip package is prepended with a
pip:prefix; - super clear that some packages come from Pip and not from Conda packages;
- if list of packages long, gets wordy, due to repeated prefix (but if list long, would probably use a YAML anyway?);
- bonus: flexibility to add Pip and non Pip in the same directive specification.
- each Pip package is prepended with a
-
the concise, but slightly confusing:
conda 'pip: numpy pandas matplotlib'- single instance of a
pip:prefix in the directive, meaning that all packages come from Pip; - concise, with only one extra word regardless of list length;
- potentially confusing to some users? that spurious
pip:prefix has no matches in other Nextflow directives - restriction (though it was not a requirement): cannot mix Pip packages with Conda ones in the same directive specification.
- single instance of a
-
the concise, less confusing, but with implementation overhead:
condaPip 'numpy pandas matplotlib'- directive is called
condaPip, specifies use for Pip packages in its name; - concise as number 2;
- less confusing than number 2, as it avoids exotic spurious prefix in the directive content;
- restriction (though it was not a requirement): cannot mix Pip packages with Conda ones in the same directive specification;
- requires implementing a new process directive.
- directive is called
-
the concise, less confusing, but still unconvincing to me:
conda 'numpy pandas matplotlib', source: 'pip'- directive has an extra
sourceattribute, that can have valuesconda(default) orpip - concise as number 2;
- less confusing than number 2, as it uses a known syntax pattern; however, the mixed use of
condaandpipstill seems somehow confusing to me; - restriction (though it was not a requirement): cannot mix Pip packages with Conda ones in the same directive specification;
- lighter implementation than 3.
- directive has an extra
@pditommaso @bentsherman @ewels @drpatelh I invite your thoughts to flow in - Thanks!
(1) is my favourite 😊
- imo is too verbose ❌
- could would and it would even allow mixing conda with pip e.g.
samtools pip: pandas numpy - does not allow mixing conda & pip ❌
- does not allow mixing conda & pip ❌
conda 'samtools', pip: 'pandas numpy', not super straightforward to be implementedconda samtools pip[pandas numpy], clashes with the use of[]made by pip ❌
Tentative implementation of case 5 here
(5) is my favourite! 🎉
I would like to call for some extra comments around 1. and 5., considering the +1s they received.
-
package-level prefix
conda 'pip:numpy pip:pandas samtools' -
extra attribute for the
condadirectiveconda 'samtools', pip: 'pandas numpy'
On 1., though it may get verbose with many packages, I would like to point out that in the end, it is the same degree of verbosity as the way we support channels:
conda 'package1 bioconda::multiqc bioconda::samtools'
If we accept this verbosity for channels, why not for Pip?
Plus, I would argue this syntax style is familiar to users of channels.
NOTE: we might even think of having a pip:: syntax sugar, i.e. like a channel in disguise.
On 5., what would be the syntax if ONLY pip packages were to be requested? The following one? More user friendly ones?
conda '' , pip: 'pandas numpy'
This case looks a bit awkward to me.
@pditommaso @drpatelh I ask you one more round of feedback based on this additional comments.
@ewels feel free to add any further comments as well if you have any.
Thank you!
I prefer option (1). It's more verbose but also easier to understand, and keep in mind that a process rarely has more than a few conda packages, so it's not like you have to type pip: a hundred times.
I'm near to kill this feature request, if you are smart enough to know you need to mix Conda and Pip for a package dependency. You can also type Conda environment file for it.
This feature is not needed
Yeah I'm inclined to agree. For me the addition of the example to the docs solved 80% of the need here already.
Thanks for the extra thoughts folks.
I do think it is worth adding the sugar syntax, the effort is small.
I did want to get extra feedback on the design before working on it, because there is a user experience component that I believe was worth discussing from multiple perspectives.
Based on the conversation, option 1. seems to me the best way forward.
I have drafted a branch that implements it: https://github.com/nextflow-io/nextflow/tree/add/conda_pip_inline . There is no PR yet because one method is not implemented yet.
Features:
- Uses the
pip:prefix to tellcondato install that package viapip - Works for both Nextflow and Wave (small implementation cost for both; asymmetric experience would be confusing to users)
- At the time I write this, I have implemented 2 key methods in a new Class file that is NOT in its final location in the codebase tree; it is there to demonstrate the implementation.
@pditommaso as we were talking about supporting the functionality in Wave CLI. From wave --help:
--conda-package='' One or more Conda packages used to build the container e.g. bioconda::samtools=1.17
The goal of this issue is then to also support the following for Wave CLI:
--conda-package=pip:numpy
I think that will NOT complicate the implementation. I will post updates here as soon as I open the PR.
After quick chat on this with Paolo, and also quick testing, we should NOT go ahead with this in-line syntax sugar:
- some specific implementations introduce artificial syntaxes that may break with future evolution of the conda syntax
- crucially, from my tests, I have found out that the YAML specification with the extra
piplist of packages is fragile versus the order of chosen channels. In fact, to my surprise, the environment will install a package such asnumpyfromconda-forgeinstead ofpip install, ifconda-forgeis higher in priority thandefaults; in the same env,scipywill instead be installed frompip install(checking on the packages build ids). This makes it unsuitable for hard coding in Nextflow without unnecessarily constraining the user.
TLDR : the pip-conda syntax integration seems fragile and may even change in the future, we should not hard-code it in a syntax sugar, but instead promote usage of a user-created YAML for the purpose (already added in the docs).
Closing.