Census 2020 example
Created a new example using the 2020 US census dataset.
- [x] minor updates on the 2010 census example - Moved to #459
I suspect it is due to https://github.com/holoviz-topics/examples/pull/429 but I'm not sure how to resolve it.
You need to re-create the conda environment locally following the contributing guide.
The test file added is a 0.1% sample of the full dataset but it is still about 8MB in size. I don't know if that is too large and should be reduced further.
It's still way too large. You should aim for the minimum dataset size possible, it's fine if it's just a few KB as long as it contains data that is representative of the whole dataset. For instance, if the code expects some data category, then it should be in the sample dataset to let the notebook run entirely.
Is there an absolute need to rename the original
censusprojectcensus_one? Without doing anything else, this is going to break all the links to its web page and deployment.I would also not call the new one
census_twobutcensus2020.
I imagine renaming the original from census to something else makes sense seeing as there are now more than one census notebooks in the examples gallery (and possibly more in the future). However, I tried renaming both to census2010 and census2020 but the doit validate step emits a warning that only lower case characters and underscore allowed in the naming. I wasn't sure ignoring that warning was ideal that is why I now renamed both to the current names.
However, I tried renaming both to census2010 and census2020 but the
doit validatestep emits a warning that only lower case characters and underscore allowed in the naming
Sounds like a bug in the validation code, something like census2020 should be allowed.
You need to re-create the conda environment locally following the contributing guide.
Done. Thanks
It's still way too large. You should aim for the minimum dataset size possible, it's fine if it's just a few KB as long as it contains data that is representative of the whole dataset. For instance, if the code expects some data category, then it should be in the sample dataset to let the notebook run entirely.
Reduced it to <1MB now.
Replying to your comment elsewhere:
Thank you. I'm still in favor of renaming the first one to census2010 though.
If you intend to rename it, then redirect links have to be set up:
- Full link: https://examples.holoviz.org/gallery/census/census.html to https://examples.holoviz.org/gallery/census2010/census2010.html
- Shortcut link: https://examples.holoviz.org/census to https://examples.holoviz.org/census2010
- Unfortunately, it's not super easy to set a redirect link for the deployment itself (https://census-notebook.holoviz-demo.anaconda.com/notebooks/census.ipynb), so renaming would break it. We have recently broken them all (new subdomain) and no one complained as far as I know so it seems it wouldn't be too bad.
Alternatively, we could just:
- Change the
titleproperty in the project YAML toCensus 2010 - Change the notebook top-level heading to
Census 2010
Alternatively, we could just:
- Change the
titleproperty in the project YAML toCensus 2010- Change the notebook top-level heading to
Census 2010
I already did these in this PR. Would that be enough to differentiate both examples eventually?
Would that be enough to differentiate both examples eventually?
I think so?
I think so?
OK. I will revert the other renaming then
My suggestion was that you use the processing script to save it to disk as new data and use that data in the notebook.
My suggestion was that you use the processing script to save it to disk as new data and use that data in the notebook.
Oh? Alright then. Will do...
@Azaya89 you will need to re-lock the project as the solve is failing:
Channels:
- conda-forge
Platform: linux-64
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... failed
PackagesNotFoundError: The following packages are not available from current channels:
- libcurl==8.11.0=hbbe4b11_0
Not your fault, sometimes conda-forge marks some packages as broken (adding the broken label on conda-forge) which means these packages are no longer available on the conda-forge channel but on conda-forge/label/broken.
https://github.com/conda-forge/admin-requests/pull/1147
It was hard to follow the discussion above, but it looks like the original one is still called census rather than census2010, and if so, I agree -- let's preserve those links. We'll put a link to census2020 within census so that wherever someone lands they will find both.
Even apart from the file size, the test data seems more complex than necessary. I think you can provide an option to write_parquet to store the test data into a single flat .parq file rather than a directory full of separate part files. Looks like the old census didn't do that, but I don't think there was a good reason for that, as e.g. opensky uses a single parquet file.
It was hard to follow the discussion above, but it looks like the original one is still called
censusrather thancensus2010, and if so, I agree -- let's preserve those links.
Correct.
We'll put a link to
census2020withincensusso that wherever someone lands they will find both.
OK. That will require a separate PR then.
Even apart from the file size, the test data seems more complex than necessary. I think you can provide an option to
write_parquetto store the test data into a single flat .parq file rather than a directory full of separate part files. Looks like the old census didn't do that, but I don't think there was a good reason for that, as e.g.openskyuses a single parquet file.
OK. I will do that.
OK. That will require a separate PR then.
I'd make sense doing it in this PR.
OK. That will require a separate PR then.
I'd make sense doing it in this PR.
This PR is about census2020 not the original census example. I think Jim is saying that we should put a link in the census PR that links to this one.
I don't see any PR open about the original census example, which means that the edit to census should be made here, so that when this PR is merged the original census example points back to this one. If there were a separate PR open for census already that could cause conflicts, then sure, you could make that change there, but otherwise here makes the most sense because it's a change about census2020 (even if it is a change to the original example).
A few notes for review:
- I'm unable to add the
labeltiles to the dashboard. It works fine on individual plots but when added to the dashboard plot, it causes the whole plot to go white. I have tried to debug the situation but I don't know how to figure it out. I tried to plot the label only in a separate cell and I noticed that the labels appear blank by default and only show at a certain zoom level, per the attached video. I don't know if it has something to do with the problem.
https://github.com/user-attachments/assets/058c4403-db95-4c23-9f21-db8ada6c1380
I'll need some help with adding labels to the dashboard plot like the way it is in the standalone plots.
- When you zoom or pan the plot and then interact with the
map_tilecheckbox or adjust themap_alphaslider, the plot resets to its original zoom level, losing your current view. I suspect the root of the problem lies in how the plot is updated when you interact with those parameters (_update_mapis called which updatesself.base_mapand this in turn calls_update_plot). This is not what I want but I also don't know how to resolve it.
https://github.com/user-attachments/assets/c47b9061-08da-493c-ae03-62e70068baa9
@Azaya89 just to let you know that I won't have the bandwidth to review this before the end of the year, in case it's urgent.
@Azaya89 just to let you know that I won't have the bandwidth to review this before the end of the year, in case it's urgent.
No problem. I can't say it's urgent :)
@jbednar a thought I had after your comment in https://github.com/holoviz-topics/examples/issues/477#issuecomment-2524312011, wouldn't it be more appropriate to extend the original census example instead of adding a new one? I haven't yet run Census 2020 (this PR), from having a quick look at the code it doesn't look so different so I have the impression maintaining the two notebooks with the same environment should be possible.
I think there are three things to consider when deciding whether examples should be grouped into a single directory:
- Are the examples thematically linked, such that you'd reasonably want to learn about one once you click on the other? (✔️ )
- Do they share essentially the same environment, such that it makes sense to set up a single environment for both? (✔️ )
- Can they use the same datasets? If different, does adding a new notebook add only minimal file or download size? (❌ ).
Here census2010 and census2020 clearly meet the first two criteria, but for 3, they have large and equal file sizes, and most people are only likely to want one or the other. So unless we have a way to download the right data per notebook rather than per example (which we might well end up with as we move away from anaconda_project), then combining 2010 and 2020 would cause most users to require double the disk space, which seems problematic.
they have large and equal file sizes
I launched these two ZIP downloads and was surprised by the difference in file sizes, with 1.3GB for census2010 and 4GB for census2020.
- [ ] We need to check why the files differ so much, are there more columns in census2020? If so, are they used? Or is the data stored differently (dtype, compression)?
@jbednar I'm asking this question as it is quite clear, to me at least, that examples are meant to be kept up-to-date, i.e. this site is not a blog. We often hear that we have too many APIs (last was during the last steering committee meeting on Friday), we should try to expose our users to the best practices only. It means that this growing number of examples needs to be maintained; this year, we have updated about half of them and it wasn't trivial. So I'm going to challenge adding census 2020 if it's just a slight variation of census 2010 with a slightly different dataset? To illustrate this, census 2010 (which was priority 4 on our NF SDG list, we almost got to it) has a bug in the legend and might be able to use hvPlot directly.
- [ ] We need to check why the files differ so much, are there more columns in census2020? If so, are they used? Or is the data stored differently (dtype, compression)?
I can point out here that the 2010 dataset contains ~306 million rows while the 2020 dataset contains over 334 million. That's an additional ~28 million rows. I'm sure that alone is not the reason for the large difference in file size but I think it also play a role?
- [ ] We need to check why the files differ so much, are there more columns in census2020? If so, are they used? Or is the data stored differently (dtype, compression)?
I can point out here that the 2010 dataset contains ~306 million rows while the 2020 dataset contains over 334 million. That's an additional ~28 million rows. I'm sure that alone is not the reason for the large difference in file size but I think it also play a role?
That's not enough to make it 3x bigger.
- [ ] We need to check why the files differ so much, are there more columns in census2020? If so, are they used? Or is the data stored differently (dtype, compression)?
I can point out here that the 2010 dataset contains ~306 million rows while the 2020 dataset contains over 334 million. That's an additional ~28 million rows. I'm sure that alone is not the reason for the large difference in file size but I think it also play a role?
That's not enough to make it 3x bigger.
Sure!
Edit: I think I have found the issue: The 2010 dataset 'x' and 'y' columns are in Float32 dtype while the 2020 dataset are in Float64 dtype. I will update the 2020 dataset to use Float32 and have it re-uploaded.
Right; we explicitly chose 32 bits (and would use 16 bits if that were an easy option) to reduce file size.
Census 2020 is a special case of people expecting the data to be more up to date. The code should not be different from 2010, so the extra maintenance should not be prohibitive.
The new dataset has been uploaded now. It's about 2.15 GB on disk.
Is the race field a categorical? Category should be much smaller than string
So I downloaded the two ZIP files (census 2010 and 2020) and compared them.
File sizes (zipped): old 1.44GB vs new 1.62GB
Parquet files in the folder: old 36 vs new 141
Race values:
Schemas:
Compression: both use SNAPPY
So while the file sizes are now closer, there are still a few differences:
- way more partitions for the new one, what is the impact on performance?
- I'm not sure why the x,y values aren't displayed as Float 32 in the new one (read with DuckDB)
- ~~race is not stored as categorical in any of them, I'm not sure why?~~ see messages below
- race values are different, shortened to one letter in the old file while complete in the new one
I'd also note that this example installs fastparquet. I think that in the end it's pyarrow that is used as the engine as it's installed (see the lockfile) and it has become the default engine in Dask (https://docs.dask.org/en/stable/changelog.html#fastparquet-engine-deprecated). Worth checking though.
I would still prefer if we had a single census example, the above shows that it's not hard for two similar examples to diverge enough to cause maintenance pain. How about updating the original census to use the 2020 dataset and dropping entirely the 2010 one? Pretty sure the text in the text in the original example still applies to the new dataset.
Is the race field a categorical? Category should be much smaller than string
Yeah (before I changed the 'x' and 'y' types)