nextflow icon indicating copy to clipboard operation
nextflow copied to clipboard

Python package installer (pip) via Conda - Reprise - syntax sugar for inline process directive

Open marcodelapierre opened this issue 1 year ago • 12 comments

As discussed in #4664, Conda already supports installing pip packages using a YAML file; corresponding documentation for was added in #4677 .

However, as @pditommaso mentioned, it would be good to also provide syntax sugar to achieve the same with simple inline specification through process directive.

I am going to outline some options for User Interface below, with some pros/cons, for discussion.

marcodelapierre avatar Jan 25 '24 08:01 marcodelapierre

Here are my first thinkings:

  • The syntax should explicitly reference its usage of conda, as there are other ways of installing Pip packages, in particular pip itself and newer solutions such as pixi;
  • The syntax should be clear and not confusing, ie correctly suggests what it does under the hood: uses conda to install pip packages;
  • The syntax should be concise enough, striking a good balance with the clarity requirement above;
  • The feature is just a syntax sugar, therefore the implementation should be proportionately simple.

So here are the options that came to mind:

  1. the accurate and potentially wordy:

    conda = 'pip:numpy pip:pandas samtools fastqc pip:matplotlib'
    
    • each Pip package is prepended with a pip: prefix;
    • super clear that some packages come from Pip and not from Conda packages;
    • if list of packages long, gets wordy, due to repeated prefix (but if list long, would probably use a YAML anyway?);
    • bonus: flexibility to add Pip and non Pip in the same directive specification.
  2. the concise, but slightly confusing:

    conda 'pip: numpy pandas matplotlib'
    
    • single instance of a pip: prefix in the directive, meaning that all packages come from Pip;
    • concise, with only one extra word regardless of list length;
    • potentially confusing to some users? that spurious pip: prefix has no matches in other Nextflow directives
    • restriction (though it was not a requirement): cannot mix Pip packages with Conda ones in the same directive specification.
  3. the concise, less confusing, but with implementation overhead:

    condaPip 'numpy pandas matplotlib'
    
    • directive is called condaPip, specifies use for Pip packages in its name;
    • concise as number 2;
    • less confusing than number 2, as it avoids exotic spurious prefix in the directive content;
    • restriction (though it was not a requirement): cannot mix Pip packages with Conda ones in the same directive specification;
    • requires implementing a new process directive.
  4. the concise, less confusing, but still unconvincing to me:

    conda 'numpy pandas matplotlib', source: 'pip'
    
    • directive has an extra source attribute, that can have values conda (default) or pip
    • concise as number 2;
    • less confusing than number 2, as it uses a known syntax pattern; however, the mixed use of conda and pip still seems somehow confusing to me;
    • restriction (though it was not a requirement): cannot mix Pip packages with Conda ones in the same directive specification;
    • lighter implementation than 3.

@pditommaso @bentsherman @ewels @drpatelh I invite your thoughts to flow in - Thanks!

marcodelapierre avatar Jan 25 '24 08:01 marcodelapierre

(1) is my favourite 😊

ewels avatar Jan 27 '24 10:01 ewels

  1. imo is too verbose ❌
  2. could would and it would even allow mixing conda with pip e.g. samtools pip: pandas numpy
  3. does not allow mixing conda & pip ❌
  4. does not allow mixing conda & pip ❌
  5. conda 'samtools', pip: 'pandas numpy', not super straightforward to be implemented
  6. conda samtools pip[pandas numpy], clashes with the use of [] made by pip ❌

pditommaso avatar Jan 29 '24 10:01 pditommaso

Tentative implementation of case 5 here

pditommaso avatar Jan 29 '24 14:01 pditommaso

(5) is my favourite! 🎉

drpatelh avatar Jan 29 '24 14:01 drpatelh

pditommaso avatar Jan 29 '24 14:01 pditommaso

I would like to call for some extra comments around 1. and 5., considering the +1s they received.

  1. package-level prefix

    conda  'pip:numpy pip:pandas samtools'
    
  2. extra attribute for the conda directive

    conda  'samtools',  pip: 'pandas numpy'
    

On 1., though it may get verbose with many packages, I would like to point out that in the end, it is the same degree of verbosity as the way we support channels:

conda 'package1  bioconda::multiqc  bioconda::samtools'

If we accept this verbosity for channels, why not for Pip?
Plus, I would argue this syntax style is familiar to users of channels.
NOTE: we might even think of having a pip:: syntax sugar, i.e. like a channel in disguise.

On 5., what would be the syntax if ONLY pip packages were to be requested? The following one? More user friendly ones?

conda '' , pip: 'pandas numpy'

This case looks a bit awkward to me.

@pditommaso @drpatelh I ask you one more round of feedback based on this additional comments.
@ewels feel free to add any further comments as well if you have any.

Thank you!

marcodelapierre avatar Jan 31 '24 12:01 marcodelapierre

I prefer option (1). It's more verbose but also easier to understand, and keep in mind that a process rarely has more than a few conda packages, so it's not like you have to type pip: a hundred times.

bentsherman avatar Jan 31 '24 14:01 bentsherman

I'm near to kill this feature request, if you are smart enough to know you need to mix Conda and Pip for a package dependency. You can also type Conda environment file for it.

This feature is not needed

pditommaso avatar Jan 31 '24 16:01 pditommaso

Yeah I'm inclined to agree. For me the addition of the example to the docs solved 80% of the need here already.

ewels avatar Jan 31 '24 22:01 ewels

Thanks for the extra thoughts folks.

I do think it is worth adding the sugar syntax, the effort is small.

I did want to get extra feedback on the design before working on it, because there is a user experience component that I believe was worth discussing from multiple perspectives.
Based on the conversation, option 1. seems to me the best way forward.

I have drafted a branch that implements it: https://github.com/nextflow-io/nextflow/tree/add/conda_pip_inline . There is no PR yet because one method is not implemented yet.

Features:

  1. Uses the pip: prefix to tell conda to install that package via pip
  2. Works for both Nextflow and Wave (small implementation cost for both; asymmetric experience would be confusing to users)
  3. At the time I write this, I have implemented 2 key methods in a new Class file that is NOT in its final location in the codebase tree; it is there to demonstrate the implementation.

marcodelapierre avatar Feb 01 '24 09:02 marcodelapierre

@pditommaso as we were talking about supporting the functionality in Wave CLI. From wave --help:

--conda-package=''    One or more Conda packages used to build the container e.g. bioconda::samtools=1.17

The goal of this issue is then to also support the following for Wave CLI:

--conda-package=pip:numpy

I think that will NOT complicate the implementation. I will post updates here as soon as I open the PR.

marcodelapierre avatar Feb 01 '24 10:02 marcodelapierre

After quick chat on this with Paolo, and also quick testing, we should NOT go ahead with this in-line syntax sugar:

  • some specific implementations introduce artificial syntaxes that may break with future evolution of the conda syntax
  • crucially, from my tests, I have found out that the YAML specification with the extra pip list of packages is fragile versus the order of chosen channels. In fact, to my surprise, the environment will install a package such as numpy from conda-forge instead of pip install, if conda-forge is higher in priority than defaults; in the same env, scipy will instead be installed from pip install (checking on the packages build ids). This makes it unsuitable for hard coding in Nextflow without unnecessarily constraining the user.

TLDR : the pip-conda syntax integration seems fragile and may even change in the future, we should not hard-code it in a syntax sugar, but instead promote usage of a user-created YAML for the purpose (already added in the docs).

Closing.

marcodelapierre avatar Feb 23 '24 07:02 marcodelapierre