python-ecology-lesson icon indicating copy to clipboard operation
python-ecology-lesson copied to clipboard

Reviving the matplotlib lesson

Open mboisson opened this issue 6 years ago • 5 comments

Hi all, I have been teaching the python ecology lesson for a few years now. I have been teaching the matplotlib lesson ever since. It used to be https://github.com/datacarpentry/python-ecology-lesson/blob/gh-pages/06-plotting-with-matplotlib.md
but it has been deprecated in lesson https://github.com/datacarpentry/python-ecology-lesson/pull/96

I have looked at the plotnine lesson, and I don't see value in teaching this beside confusing novice (and experts along) with a completely different and non-pythonesque syntax. To top it off, the following lesson comes back to Matplotlib, so I question the reason for moving away from Matplotlib in the first place.

Now, I don't mind that some people prefer teaching plotnine to matplotlib, but the deprecation of the matplotlib lesson makes it basically impossible to teach unless you happen to take a PDF print of the lesson 3 years ago (like I did).

I would request bringing the old lesson back as an alternative way to teach plotting.

mboisson avatar Jun 07 '19 19:06 mboisson

I'm going to level with you: I don't really care which we teach. I use Matplotlib, base pandas, plotnine, and a bit of Seaborn for various things. But the question isn't as simple as what library we're using. I'm going to suggest clarifying this into three pieces:

  1. Should we teach grammar of graphics?
  2. Is plotnine the best way to do that?
  3. How can we accommodate other plotting libraries?

History

A few years ago, it was noted that what we were doing in the Python lesson was really different than the R lesson. Namely, R was teaching grammar of graphics. We were not. Obviously, there are going to be some things in the Python and R lessons that don't make sense to be the same, because they just aren't the same. But as a general topic, using grammar of graphics vs. other plotting grammars was a choice for us to make. Plotting with GoG, generally, and ggplot, specifically, is pretty dominant in the ecology-evolution-behavior/population biology fields that make up our learners for this lesson. So we switched. That decision has been revisited, but we've largely been in agreement that the GoG framework is worth teaching to the learners we serve.

This is one of those places where the domain specificity in the DC lessons shines through pretty brightly. GoG plotting is really common in my field - there are three or four gg* packages that I use regularly (ex: ggTree for plotting phylogenetic trees). In browsing ecology journals, nearly every plot is some flavor of gg*. Or they made it in Excel, which is not my problem ;) On the other hand, an ecologist who is more on the geology side than the biology side might not use this as much (example in the Atmospheric science lesson here).

Where that leaves us

It's not impossible to teach with the matplotlib lesson. It's in _extras, and can be substituted for the current lesson. But it hasn't seen a lot of love over the long term, so I'm not sure how in-date it is.

At this point, the dc-py-es lessons use this as a template. So I'm disinclined to make large changes (especially since we're doing a release this week) on the gh-pages branch. But I do think it would be fine if we had multiple versions of this lesson. If someone wanted to make changes and contribute them back in _extras, I think that would be fine, whether those lessons are Matplot, or Seaborn or whatever. I've floated the idea in the past of having multiple branches, which have different lesson pre-rendered, so you could make your workshop website more easily. But somehow that never seems popular ...

tl;dr

This is a pretty weird lesson because we have a variety of pressures: keeping synced with other lessons, responding to how ecologists are working with data and what packages they're using, and accommodating a larger culture of contributors who may not fit that exact mold. So we need to decide what we're teaching, while bearing in mind who we're teaching it to.

wrightaprilm avatar Jun 07 '19 20:06 wrightaprilm

I can see where you are coming from, and it probably makes sense from an ecologist' point of view. However, looking at the current lesson, I don't see any explanation of what is "Grammar of Graphics" (in fact, I had to search online to see what this is about). It is mentioned, but there is no explanation of what it is.

From an outsider (not an ecologist) with absolutely no a-priori knowledge of what "Grammar of Graphics" is, all I see is a weird syntax that does not fit with typical Python syntax.

The lesson and the approach could make sense to me, but it would have to actually explain what Grammar of Graphics is as part of the lesson.

I would say that one thing to keep is mind is that : because the dataset is so simple and so intuitive to understand (everybody knows the concepts of animal, species, weight, etc.), this lesson is very useful not only for ecologist, but for anyone who wants to learn about data analysis and visualization with Python.

This is the de-facto lesson that we teach for our researchers, which may come from any number of fields, not only ecology.

mboisson avatar Jun 07 '19 20:06 mboisson

I have been reading on the topic, pondering the ideas, and discussing with an ecologist colleague. I can now see why you chose to go with a "ggplot-style" library, because of Grammar of graphics.

While I can't wrap my head around the syntax that GG imposes, and I will probably stay with matplotlib, I can see value in the GG concept.

I would however urge the authors of the lesson to actually explain what GG is as part of the lesson. Giving examples of a weird syntax of a library that is very different from the rest of the python ecosystem without explaining the rationale behind it is very confusing.

mboisson avatar Jun 07 '19 21:06 mboisson

Ha, the weirdness of ggplot doesn't even ping for me any more. When I started my PhD in evolutionary biology, the ecosystem was split about evenly between R and Python. ggplot was pretty popular, but exploded in popularity after the release of 0.9.0. I've spent my whole career going back and forth that all the weirdness has diffused through my consciousness.

But you're not the first person who is a more experienced in Python to note that the syntax of GoG is, in fact, really weird. I'll tag in an issue once the release is out to invite contributions to add an explanation of what GoG. I really try to be more of a steward of community contributions than make them myself. This would make a nice first task for someone doing instructor training checkout.

This lesson is such a pedagogical hot potato that I actually lead a maintainer's meeting using it as an example of how to get good feedback from the community, and synthesize it into actionable plans. Ultimately, we do have to have something that is the "canonical" version that renders to the site. But having other lessons if folks want to use them is fine by me. If you do want to do any work on the Matplot lesson in _extras, I would be happy to see contributions go there. That's not an obligation, though. I personally maintain a fork of this lesson for working with taxonomic data for evolutionary analysis.

wrightaprilm avatar Jun 07 '19 21:06 wrightaprilm

Interesting discussion and valid remark about the lack of a proper GoG explanation.

I've been supporting the introduction of the GoG concept as it is indeed well known in the ecological community (due to the ggplot popularity) and it is - also in general - a very powerful concept. It provides a more declarative approach of creating plots (e.g. facet_wrap versus for ax in axes: ax.plot()) based on a tidy (R community slang for long format or denormalized) data representation.

In another course, we try to provide some guidance on when GoG provides added value:

  • When your data consists of only 1 categorical variable, LOW added value
  • When working with timeseries data from sensors or continuous logging, LOW added value
  • When working with different experiments, different conditions, (factorial) experimental designs, HIGH added value

As many (ecological) researchers are facing the latter data sets, it make sense for them.

Still, I agree that the direct adoptation of the R ggplot syntax in a Python context can feel odd. We picked plotnine as it still integrated with matplotlib (it returns a matplotlib Figure), hence it provides the convenience of GoG with the customization possibilites of Matplotlib. altair provides a more Python-oriented syntax of the GoG concept, but has no link with Matplotlib anymore.

So, I agree: providing the rationale would be useful. But maybe we should add the rationale for the lesson instructors notes instead of the participants? Understanding the background and decision is important for the instructors to know the context. But providing too much background on GoG to the participants could lead toan overload of information... For the participants, getting things done is what matters and I would rather explain the link with the data representation (cfr. data representation carpentry lesson) and define when GoG provides added value? Would that make sense?

stijnvanhoey avatar Jun 08 '19 09:06 stijnvanhoey

Hi all. Giving that there hasn't been any further discussion in this thread for some years I am closing this issue.

btovar avatar May 19 '23 11:05 btovar