python-novice-gapminder icon indicating copy to clipboard operation
python-novice-gapminder copied to clipboard

Added episode on Conda

Open davidrpugh opened this issue 6 years ago • 13 comments

For the version of the Plotting and Programming in Python that I teach to the KAUST community, I need to introduce Conda as a tool for managing Python environments. We are encouraging our users to make use of Conda to install project specific software stacks in order to...

  • increase the reproducibility of their research projects
  • increase the portability (i.e., ability to move from laptop/workstation to clusters) of their research projects

...adding an episode on Conda does increase the cognitive load for learners. I think the benefits for my target audience outweigh the additional burden but I am aware that this may not be true for a more general target audience.

I am opening this PR in order to get feedback and in case the wider community wishes to provided incorporate the episode into the lesson.

davidrpugh avatar Jun 23 '19 07:06 davidrpugh

@alee @vahtras @ntmoore @souravsingh I just fixed the merge conflicts on this PR. When you get a chance some feedback on this PR would be appreciated.

davidrpugh avatar Jun 24 '19 11:06 davidrpugh

Hi @davidrpugh , I think this is interested and useful material - particularly in the context of the git/version-control/software-should-be-audit-able lesson. I don't know if I can imagine covering this in the python section of a standard two-day SWC workshop though. Where do you discuss this in the KAUST workshop? For someone who's never written python before this doesn't seem like it should be lesson 00. Cool material though!

ntmoore avatar Jun 24 '19 16:06 ntmoore

@ntmoore thanks for the feedback! First thing I should mention is that the Conda episode is a condensed version of a half-day Introduction to Conda for (Data) Scientists lesson that I have developed.

When I teach my KAUST Introduction to Python Lesson managing environments using Conda is the first thing that I teach. My experience teaching Introduction to Python Lesson using Anaconda at KAUST last year was that learners have a difficult time with the following

  1. shifting to project specific software installs and workflows as Anaconda encourages users to install everything into a single "global" environment;
  2. shifting their Python workflows to our university cluster resources (even though Anaconda is installed on the clusters).

Typically the difficulties with moving projects to remote clusters stem from the fact that learners have additional packages that are not included in the default Anaconda install that they need for a project (and have installed on their machine using Conda or maybe pip). At this point learners need detailed Conda training and I developed my half-day Conda course to meet this need.

Once I developed the half-day course, I felt that I could condense the really critical commands and workflows into a single episode which I offer at the outset of my Introduction to Python lesson in order to make sure that learners get started with a "best practices" for their Python workflows. Extra cognitive load of learning a little bit of Conda is offset by the benefit of learners leaving with improved Python workflows that better enable reproducibility and portability of their work.

Another special feature of my audience at KAUST is that I have repeated interaction with learners over a period of at least one to two years so my spending time to teach best practices up front reduces future work for me. I recognize that this is not the case in general for instructors of these lessons.

davidrpugh avatar Jun 25 '19 06:06 davidrpugh

@alee do you have any thoughts to share? If you concur with @ntmoore then I will merge these changes into my local fork and close this PR.

davidrpugh avatar Jul 01 '19 07:07 davidrpugh

Sorry for the delay - I'm a little slow on the uptake due to other projects at the moment and have been punting on this deliberately as I wanted to give it a careful look. I think we'd want to have consensus from the other maintainers, instructors, & community members as this is a fairly major change though I'm generally in favor of conda. I'll try to get back to you ASAP!

alee avatar Jul 01 '19 07:07 alee

I understand and no worries: it is a substantial change!

davidrpugh avatar Jul 01 '19 08:07 davidrpugh

I have not been active in the discussion so far. I was skeptical at first, but now I find it being a valuable addition! /Olav

vahtras avatar Jul 01 '19 08:07 vahtras

@vahtras good to know that you found it a valuable addition. Environment and package management using Conda is well suited as an introductory topic when the audience is primarily researchers looking to use Python in their research going forward. I don't know if I would cover Conda if I was teaching a generic introduction to Python course.

davidrpugh avatar Jul 01 '19 10:07 davidrpugh

Thanks @davidrpugh for the very thoughtful contribution! Here are a couple thoughts:

  • In the "Package management" section, for the transition from OS package manager to project-specific package managers, it doesn't seem accurate to claim that OS package managers solve a more general problem. As you point out while introducing Conda, Conda is capable of managing packages that are beyond just python (unlike pip), and in addition manages environment. So maybe emphasizing the environment isolation capabilities of these environment+package managers would be better.
  • Maybe emphasize conda's ability to manage non-python packages as well? As an astrophysicist, there are lots of times when I run into packages (and their python wrappers) that require non-python packages like openmpi, LAPACK, openssl etc.
  • It seems like a reasonable thing for the learner to ask "so how do I know what I can install with conda?" I'm not entirely sure if this question should be addressed or how it should be addressed, though.
  • It is perhaps useful to point out that pipenv and (previously) pip+virtualenv serve similar purposes to conda so that the learners would be able to connect this lesson to the computing platform they are on in case the platform uses one of the other solutions to python package and environment management?

yupinghuang avatar Jul 05 '19 22:07 yupinghuang

@yupinghuang Thanks for your substantive feedback! This episode on Conda is a reduction of a longer half-day lesson on Conda that I have developed. Most of your comments are applicable to that lesson as well.

* In the "Package management" section, for the transition from OS package manager to project-specific package managers, it doesn't seem accurate to claim that OS package managers solve a more **general problem**. As you point out while introducing Conda, Conda is capable of managing packages that are beyond just python (unlike pip), and in addition manages environment. So maybe emphasizing the environment isolation capabilities of these environment+package managers would be better.

I will work to clean up my verbiage in this section. OS package managers are not really solving a more general problem then Conda, if anything it could be argued the other way around! Conda manages environments and packages (just not OS-specific packages).

* Maybe emphasize conda's ability to manage non-python packages as well? As an astrophysicist, there are lots of times when I run into packages (and their python wrappers) that require non-python packages like openmpi, LAPACK, openssl etc.

Yes! This is a major benefit of using Conda over pip et al.

* It seems like a reasonable thing for the learner to ask "so how do I know what I can install with conda?" I'm not entirely sure if this question should be addressed or how it should be addressed, though.

A partial answer might be to use the conda search command to look for existing Conda packages. A full answer would be "pretty much anything if you are willing to learn who to build packages yourself" and then an episode on using Conda-Build. At some point I will add an episode on Conda-Build to my Conda lesson but anything beyond conda search is out of scope for this episode.

* It is perhaps useful to point out that pipenv and (previously) pip+virtualenv serve similar purposes to conda so that the learners would be able to connect this lesson to the computing platform they are on in case the platform uses one of the other solutions to python package and environment management?

This point I don't quite understand. Conda is cross-platform and should work on any OS to which the learners might have access. Did you mean something else?

davidrpugh avatar Jul 06 '19 04:07 davidrpugh

@davidrpugh Let me elaborate on my last point. I'd like to think that the goal of this lesson is to teach dependency and environment management via conda. Conda is one of the very popular tools, but the skillset can be tool-independent. I suggested pipenv and virtualenv because they are quite commonplace too, so that when the learner runs into those some time down their path, they'd be able to connect the dots.

The scenario I was describing is as follows. Say a scientist is doing their computation on an HPC cluster and the HPC admins recommend using the python interpreter that comes already pre-installed (and maybe optimized). Said scientist only needs python packages in their work, then pipenv or virtualenv would be a better choice for them in this case than conda since it's easier to get those to work with existing python installation. It is not always trivial to install python, as much as miniconda tries to automate it.

yupinghuang avatar Jul 11 '19 18:07 yupinghuang

@davidrpugh Let me elaborate on my last point. I'd like to think that the goal of this lesson is to teach dependency and environment management via conda. Conda is one of the very popular tools, but the skillset can be tool-independent. I suggested pipenv and virtualenv because they are quite commonplace too, so that when the learner runs into those some time down their path, they'd be able to connect the dots.

I will add a callout box mentioning the existence of pip and friends as alternatives for managing environments and packages for pure Python projects. Learners should at least be aware of the existence of these tools because they will encounter them on SO or Google when trouble-shooting.

By focusing entirely on Conda I am being intentionally opinionated: I do think that conda is the better environment and package management tool for our target audience of scientific researchers data scientists, etc and don't want to add cognitive load introducing another tool that is less well suited for learners use cases.

The scenario I was describing is as follows. Say a scientist is doing their computation on an HPC cluster and the HPC admins recommend using the python interpreter that comes already pre-installed (and maybe optimized). Said scientist only needs python packages in their work, then pipenv or virtualenv would be a better choice for them in this case than conda since it's easier to get those to work with existing python installation. It is not always trivial to install python, as much as miniconda tries to automate it.

Installation of Python via Miniconda has never been an issue in any of the carpentry workshops that I have taught. Sometimes learners show up without having bothered to install Miniconda, but that is a separate issue.

I am staff scientist at KAUST where we have several top HPC machines and the scenario you describe was one of the motivating use case for my Introduction to Conda for (Data) Scientists lesson.

Power users of our HPC clusters are traditional HPC users who work almost entirely on the cluster (and never on their local machines). These users make heavy use of the optimized Python installed and maintained by our staff. These users never really have an option to use pip because if they need a package for their research then they will request our staff to build it from source in order to get best performance. This creates a lot of work for our staff but fortunately(?) such users are few in number.

The typical user of cluster facilities at KAUST is not a traditional HPC user. The median user has a workflow that most of the time works fine on his/her laptop or workstation but occasionally there is a need to scale up the workflow by porting it to the cluster. Portability of workflow is the key concern for these users. Using conda they can create an environment on their local machine that can be easily reproduced on the remote cluster. The performance of the numerical Python packages downloaded via conda, while not as good as hand-tuned packages, is typically superior to the same packages installed via pip and since Conda doesn't require elevated privileges to install, users can install miniconda and manage their entire Python app stack themselves (which users generally prefer and frees up staff time to work on other projects).

We have started advocating our users to use conda for their Python needs; if performance becomes an issue only then does it make sense to consider moving the workflow to a hand-optimized Python install.

Thanks again for your feedback!

davidrpugh avatar Jul 12 '19 06:07 davidrpugh

When teaching this course, I've used Juypterhub which provides Juypter Notebooks and the data files. Everything is in a Web interface. Great way to get started; however, they will have to move to using a different environment for more effective programming. I just mention different options at the beginning for running Python and get right into programming with Juypterhub.

daviddelene avatar Sep 22 '20 12:09 daviddelene

Closing inactive PRs with unresolved conflicts before transitioning to Carpentries Workbench. Welcome with an updated PR after the transition has been completed.

vahtras avatar Apr 14 '23 11:04 vahtras