Resolve error(s) in looping exercise solution.
Hi, I think there is an error in the last exercise's solution in episode 14: Looping Datasets.
I don't think that this is associated with an open issue, sorry if I'm mistaken.
The last exercise "Comparing Data" has the following solution:
import glob
import pandas as pd
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1,1)
for filename in glob.glob('data/gapminder_gdp*.csv'):
dataframe = pd.read_csv(filename)
# extract <region> from the filename, expected to be in the format 'data/gapminder_gdp_<region>.csv'.
# we will split the string using the split method and `_` as our separator,
# retrieve the last string in the list that split returns (`<region>.csv`),
# and then remove the `.csv` extension from that string.
region = filename.split('_')[-1][:-4]
dataframe.mean().plot(ax=ax, label=region)
plt.legend()
plt.show()
However, this results in an error at in the call to dataframe.mean(), eg.:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[22], line 1
----> 1 dataframe.mean()
File ~/miniconda3/envs/carp-py/lib/python3.11/site-packages/pandas/core/frame.py:11666, in DataFrame.mean(self, axis, skipna, numeric_only, **kwargs)
11658 @doc(make_doc("mean", ndim=2))
11659 def mean(
11660 self,
(...)
11664 **kwargs,
11665 ):
> 11666 result = super().mean(axis, skipna, numeric_only, **kwargs)
11667 if isinstance(result, Series):
11668 result = result.__finalize__(self, method="mean")
File ~/miniconda3/envs/carp-py/lib/python3.11/site-packages/pandas/core/generic.py:12413, in NDFrame.mean(self, axis, skipna, numeric_only, **kwargs)
12406 def mean(
12407 self,
12408 axis: Axis | None = 0,
(...)
12411 **kwargs,
12412 ) -> Series | float:
> 12413 return self._stat_function(
12414 "mean", nanops.nanmean, axis, skipna, numeric_only, **kwargs
12415 )
File ~/miniconda3/envs/carp-py/lib/python3.11/site-packages/pandas/core/generic.py:12370, in NDFrame._stat_function(self, name, func, axis, skipna, numeric_only, **kwargs)
12366 nv.validate_func(name, (), kwargs)
12368 validate_bool_kwarg(skipna, "skipna", none_allowed=False)
> 12370 return self._reduce(
12371 func, name=name, axis=axis, skipna=skipna, numeric_only=numeric_only
12372 )
File ~/miniconda3/envs/carp-py/lib/python3.11/site-packages/pandas/core/frame.py:11535, in DataFrame._reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
11531 df = df.T
11533 # After possibly _get_data and transposing, we are now in the
11534 # simple case where we can use BlockManager.reduce
> 11535 res = df._mgr.reduce(blk_func)
11536 out = df._constructor_from_mgr(res, axes=res.axes).iloc[0]
11537 if out_dtype is not None and out.dtype != "boolean":
File ~/miniconda3/envs/carp-py/lib/python3.11/site-packages/pandas/core/internals/managers.py:1501, in BlockManager.reduce(self, func)
1499 res_blocks: list[Block] = []
1500 for blk in self.blocks:
-> 1501 nbs = blk.reduce(func)
1502 res_blocks.extend(nbs)
1504 index = Index([None]) # placeholder
File ~/miniconda3/envs/carp-py/lib/python3.11/site-packages/pandas/core/internals/blocks.py:404, in Block.reduce(self, func)
398 @final
399 def reduce(self, func) -> list[Block]:
400 # We will apply the function and reshape the result into a single-row
401 # Block with the same mgr_locs; squeezing will be done at a higher level
402 assert self.ndim == 2
--> 404 result = func(self.values)
406 if self.values.ndim == 1:
407 res_values = result
File ~/miniconda3/envs/carp-py/lib/python3.11/site-packages/pandas/core/frame.py:11454, in DataFrame._reduce.<locals>.blk_func(values, axis)
11452 return np.array([result])
11453 else:
> 11454 return op(values, axis=axis, skipna=skipna, **kwds)
File ~/miniconda3/envs/carp-py/lib/python3.11/site-packages/pandas/core/nanops.py:147, in bottleneck_switch.__call__.<locals>.f(values, axis, skipna, **kwds)
145 result = alt(values, axis=axis, skipna=skipna, **kwds)
146 else:
--> 147 result = alt(values, axis=axis, skipna=skipna, **kwds)
149 return result
File ~/miniconda3/envs/carp-py/lib/python3.11/site-packages/pandas/core/nanops.py:404, in _datetimelike_compat.<locals>.new_func(values, axis, skipna, mask, **kwargs)
401 if datetimelike and mask is None:
402 mask = isna(values)
--> 404 result = func(values, axis=axis, skipna=skipna, mask=mask, **kwargs)
406 if datetimelike:
407 result = _wrap_results(result, orig_values.dtype, fill_value=iNaT)
File ~/miniconda3/envs/carp-py/lib/python3.11/site-packages/pandas/core/nanops.py:720, in nanmean(values, axis, skipna, mask)
718 count = _get_counts(values.shape, mask, axis, dtype=dtype_count)
719 the_sum = values.sum(axis, dtype=dtype_sum)
--> 720 the_sum = _ensure_numeric(the_sum)
722 if axis is not None and getattr(the_sum, "ndim", False):
723 count = cast(np.ndarray, count)
File ~/miniconda3/envs/carp-py/lib/python3.11/site-packages/pandas/core/nanops.py:1686, in _ensure_numeric(x)
1683 inferred = lib.infer_dtype(x)
1684 if inferred in ["string", "mixed"]:
1685 # GH#44008, GH#36703 avoid casting e.g. strings to numeric
-> 1686 raise TypeError(f"Could not convert {x} to numeric")
1687 try:
1688 x = x.astype(np.complex128)
TypeError: Could not convert ['AustraliaNew Zealand'] to numeric
This should be corrected by using index_col="country" when reading the csv.
Secondly, there is another error caused due to the continent column being present in the gapminder_gdp_americas.csv (but not any of the others).
I appreciate that these may be deliberate mistakes, though I suspect not since this is within the solution.
:ok: Pre-flight checks passed :smiley:
This pull request has been checked and contains no modified workflow files, spoofing, or invalid commits.
It should be safe to Approve and Run the workflows that need maintainer approval.
Thank you! This used to be perfectly fine because pandas would silently ignore numeric operations, such as df.mean(), for columns where it did not make sense. Apparently, the approach is now to transform all columns to numeric format first, and then calculate the mean. I don't think this is a good idea, because when this fails, we get the traceback with TypeError we see.
Setting the index column to 'country' works for some files but one of the files (americas) also has the continent column (All values the same). This could be an oversight in the generation of the datafile, since the regional tables now do not follow the same pattern.
So what to do? Your suggestion together with cleaning up the data files would work. On the other hand life is full of unclean data so a more realistic approach could be to filter out the 'gdp' columns.
Something like
import glob
import pandas as pd
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1,1)
for filename in glob.glob('data/gapminder_gdp*.csv'):
dataframe = pd.read_csv(filename)
region = filename.split('_')[-1][:-4]
gdpdata = [c for c in dataframe.columns if c.startswith('gdp')]
dataframe[gdpdata].mean().plot(ax=ax, label=region)
plt.legend()
which gives
Whoops, the x labels do not look good here either, they write into each other. We need to handle them somehow.
To summarise: great that you found this error! I myself haven't decided which solution is the best here and appreciate comments from other maintainers.
Olav
Thank you for catching this @davidwilby and taking the time to document and submit a PR for this! Ditto @vahtras for the additional context.
I like the filtering approach and think it's useful. To muddy the waters even more here's two other ways we could do this:
- use dataframe.filter e.g.,
dataframe.filter(like="gdp").mean().plot(...)
to only compute the mean on columns with gdp in the name.
- set the numeric_only kwarg to True when we compute the mean, e.g.,
dataframe.mean(numeric_only=True).plot(...)
to only compute the mean on columns with numeric data
Great options @alee! I did not know about filter...like. Both are fine with me, but I guess numeric_only is closest to the original solution, so I would prefer that in this case.
Glad to be of help!
Depending on how kind you want to be to learners, it may be worth including a note in the exercise description that there's a complication/gotcha/extra challenge in this one. Along with a comment in the solution pointing out the reason for this.
closing this in favor of #674 - thanks!