python-ecology-lesson
python-ecology-lesson copied to clipboard
Avoid confusing, potentially ambiguous commands for slicing/indexing data frames
In episode 3 (https://datacarpentry.org/python-ecology-lesson/03-index-slice-subset/index.html, actually listed as 4. in https://datacarpentry.org/python-ecology-lesson/ ), the distinction between .iloc
method for accessing entries by position and .loc
to access them by identifier is made, but a third possibility is shown with surveys_df[0:3]
, which accesses the indices by position.
That command is redundant with surveys_df.iloc[0:3]
and is similar to accessing a column, i.e. df["column_name"]
, and can be mistaken for selecting a column if those are numbers. On top of that something using row and column positions like df[0:2,1]
will raise an error.
While the command could be useful and best practices could avoid mistaking row/column identifiers, the lesson could instead say that df["col_name"]
or df["list", "of", "col_names"]
will access columns, while df.loc["index"]
will access rows. That will keep position and identifier-based selection as separate commands for beginners.
# example
import pandas
from numpy.random import randint
arr = randint(0,10, [3,3])
df = pandas.DataFrame(arr)
df[0] # selects first column
df[0:1] # selects first row
Hi, @caesoma! Apologies for taking so long to respond.
Very good and valid point! I think the best solution would be to make learners aware of this in a form of an exercise or an additional material. Would you be willing to make this contribution to the episode?
Hi, sure, I can do that. Let me know what format this exercise should be in.
Could you please draft a PR modifying existing and adding new text and/or exercise? we could then discuss the details such as format, etc. And please let me know if you need any help along the way.
Sorry for the long delay as well. Finally got around to making the proposed changes.
I'm closing this issue as we worked through and accepted the relevant PR back in April.