examples icon indicating copy to clipboard operation
examples copied to clipboard

Census 2020 example

Open Azaya89 opened this issue 1 year ago • 41 comments

Created a new example using the 2020 US census dataset.

  • [x] minor updates on the 2010 census example - Moved to #459

Azaya89 avatar Oct 17 '24 00:10 Azaya89

I suspect it is due to https://github.com/holoviz-topics/examples/pull/429 but I'm not sure how to resolve it.

You need to re-create the conda environment locally following the contributing guide.

The test file added is a 0.1% sample of the full dataset but it is still about 8MB in size. I don't know if that is too large and should be reduced further.

It's still way too large. You should aim for the minimum dataset size possible, it's fine if it's just a few KB as long as it contains data that is representative of the whole dataset. For instance, if the code expects some data category, then it should be in the sample dataset to let the notebook run entirely.

maximlt avatar Oct 17 '24 07:10 maximlt

Is there an absolute need to rename the original census project census_one? Without doing anything else, this is going to break all the links to its web page and deployment.

I would also not call the new one census_two but census2020.

I imagine renaming the original from census to something else makes sense seeing as there are now more than one census notebooks in the examples gallery (and possibly more in the future). However, I tried renaming both to census2010 and census2020 but the doit validate step emits a warning that only lower case characters and underscore allowed in the naming. I wasn't sure ignoring that warning was ideal that is why I now renamed both to the current names.

Azaya89 avatar Oct 17 '24 09:10 Azaya89

However, I tried renaming both to census2010 and census2020 but the doit validate step emits a warning that only lower case characters and underscore allowed in the naming

Sounds like a bug in the validation code, something like census2020 should be allowed.

maximlt avatar Oct 17 '24 10:10 maximlt

You need to re-create the conda environment locally following the contributing guide.

Done. Thanks

It's still way too large. You should aim for the minimum dataset size possible, it's fine if it's just a few KB as long as it contains data that is representative of the whole dataset. For instance, if the code expects some data category, then it should be in the sample dataset to let the notebook run entirely.

Reduced it to <1MB now.

Azaya89 avatar Oct 17 '24 11:10 Azaya89

Replying to your comment elsewhere:

Thank you. I'm still in favor of renaming the first one to census2010 though.

If you intend to rename it, then redirect links have to be set up:

  • Full link: https://examples.holoviz.org/gallery/census/census.html to https://examples.holoviz.org/gallery/census2010/census2010.html
  • Shortcut link: https://examples.holoviz.org/census to https://examples.holoviz.org/census2010
  • Unfortunately, it's not super easy to set a redirect link for the deployment itself (https://census-notebook.holoviz-demo.anaconda.com/notebooks/census.ipynb), so renaming would break it. We have recently broken them all (new subdomain) and no one complained as far as I know so it seems it wouldn't be too bad.

Alternatively, we could just:

  • Change the title property in the project YAML to Census 2010
  • Change the notebook top-level heading to Census 2010

maximlt avatar Oct 21 '24 06:10 maximlt

Alternatively, we could just:

  • Change the title property in the project YAML to Census 2010
  • Change the notebook top-level heading to Census 2010

I already did these in this PR. Would that be enough to differentiate both examples eventually?

Azaya89 avatar Oct 21 '24 11:10 Azaya89

Would that be enough to differentiate both examples eventually?

I think so?

maximlt avatar Oct 21 '24 11:10 maximlt

I think so?

OK. I will revert the other renaming then

Azaya89 avatar Oct 21 '24 14:10 Azaya89

My suggestion was that you use the processing script to save it to disk as new data and use that data in the notebook.

hoxbro avatar Nov 07 '24 07:11 hoxbro

My suggestion was that you use the processing script to save it to disk as new data and use that data in the notebook.

Oh? Alright then. Will do...

Azaya89 avatar Nov 07 '24 10:11 Azaya89

@Azaya89 you will need to re-lock the project as the solve is failing:

Channels:
 - conda-forge
Platform: linux-64
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... failed

PackagesNotFoundError: The following packages are not available from current channels:

  - libcurl==8.11.0=hbbe4b11_0

Not your fault, sometimes conda-forge marks some packages as broken (adding the broken label on conda-forge) which means these packages are no longer available on the conda-forge channel but on conda-forge/label/broken.

https://github.com/conda-forge/admin-requests/pull/1147

maximlt avatar Nov 15 '24 05:11 maximlt

It was hard to follow the discussion above, but it looks like the original one is still called census rather than census2010, and if so, I agree -- let's preserve those links. We'll put a link to census2020 within census so that wherever someone lands they will find both.

Even apart from the file size, the test data seems more complex than necessary. I think you can provide an option to write_parquet to store the test data into a single flat .parq file rather than a directory full of separate part files. Looks like the old census didn't do that, but I don't think there was a good reason for that, as e.g. opensky uses a single parquet file.

jbednar avatar Dec 02 '24 17:12 jbednar

It was hard to follow the discussion above, but it looks like the original one is still called census rather than census2010, and if so, I agree -- let's preserve those links.

Correct.

We'll put a link to census2020 within census so that wherever someone lands they will find both.

OK. That will require a separate PR then.

Even apart from the file size, the test data seems more complex than necessary. I think you can provide an option to write_parquet to store the test data into a single flat .parq file rather than a directory full of separate part files. Looks like the old census didn't do that, but I don't think there was a good reason for that, as e.g. opensky uses a single parquet file.

OK. I will do that.

Azaya89 avatar Dec 02 '24 20:12 Azaya89

OK. That will require a separate PR then.

I'd make sense doing it in this PR.

maximlt avatar Dec 02 '24 20:12 maximlt

OK. That will require a separate PR then.

I'd make sense doing it in this PR.

This PR is about census2020 not the original census example. I think Jim is saying that we should put a link in the census PR that links to this one.

Azaya89 avatar Dec 03 '24 09:12 Azaya89

I don't see any PR open about the original census example, which means that the edit to census should be made here, so that when this PR is merged the original census example points back to this one. If there were a separate PR open for census already that could cause conflicts, then sure, you could make that change there, but otherwise here makes the most sense because it's a change about census2020 (even if it is a change to the original example).

jbednar avatar Dec 03 '24 20:12 jbednar

A few notes for review:

  1. I'm unable to add the label tiles to the dashboard. It works fine on individual plots but when added to the dashboard plot, it causes the whole plot to go white. I have tried to debug the situation but I don't know how to figure it out. I tried to plot the label only in a separate cell and I noticed that the labels appear blank by default and only show at a certain zoom level, per the attached video. I don't know if it has something to do with the problem.

https://github.com/user-attachments/assets/058c4403-db95-4c23-9f21-db8ada6c1380

I'll need some help with adding labels to the dashboard plot like the way it is in the standalone plots.

  1. When you zoom or pan the plot and then interact with the map_tile checkbox or adjust the map_alpha slider, the plot resets to its original zoom level, losing your current view. I suspect the root of the problem lies in how the plot is updated when you interact with those parameters (_update_map is called which updates self.base_map and this in turn calls _update_plot). This is not what I want but I also don't know how to resolve it.

https://github.com/user-attachments/assets/c47b9061-08da-493c-ae03-62e70068baa9

Azaya89 avatar Dec 05 '24 19:12 Azaya89

@Azaya89 just to let you know that I won't have the bandwidth to review this before the end of the year, in case it's urgent.

maximlt avatar Dec 05 '24 22:12 maximlt

@Azaya89 just to let you know that I won't have the bandwidth to review this before the end of the year, in case it's urgent.

No problem. I can't say it's urgent :)

Azaya89 avatar Dec 05 '24 22:12 Azaya89

@jbednar a thought I had after your comment in https://github.com/holoviz-topics/examples/issues/477#issuecomment-2524312011, wouldn't it be more appropriate to extend the original census example instead of adding a new one? I haven't yet run Census 2020 (this PR), from having a quick look at the code it doesn't look so different so I have the impression maintaining the two notebooks with the same environment should be possible.

maximlt avatar Dec 08 '24 11:12 maximlt

I think there are three things to consider when deciding whether examples should be grouped into a single directory:

  1. Are the examples thematically linked, such that you'd reasonably want to learn about one once you click on the other? (✔️ )
  2. Do they share essentially the same environment, such that it makes sense to set up a single environment for both? (✔️ )
  3. Can they use the same datasets? If different, does adding a new notebook add only minimal file or download size? (❌ ).

Here census2010 and census2020 clearly meet the first two criteria, but for 3, they have large and equal file sizes, and most people are only likely to want one or the other. So unless we have a way to download the right data per notebook rather than per example (which we might well end up with as we move away from anaconda_project), then combining 2010 and 2020 would cause most users to require double the disk space, which seems problematic.

jbednar avatar Dec 09 '24 00:12 jbednar

they have large and equal file sizes

I launched these two ZIP downloads and was surprised by the difference in file sizes, with 1.3GB for census2010 and 4GB for census2020.

  • [ ] We need to check why the files differ so much, are there more columns in census2020? If so, are they used? Or is the data stored differently (dtype, compression)?

@jbednar I'm asking this question as it is quite clear, to me at least, that examples are meant to be kept up-to-date, i.e. this site is not a blog. We often hear that we have too many APIs (last was during the last steering committee meeting on Friday), we should try to expose our users to the best practices only. It means that this growing number of examples needs to be maintained; this year, we have updated about half of them and it wasn't trivial. So I'm going to challenge adding census 2020 if it's just a slight variation of census 2010 with a slightly different dataset? To illustrate this, census 2010 (which was priority 4 on our NF SDG list, we almost got to it) has a bug in the legend and might be able to use hvPlot directly.

image

maximlt avatar Dec 09 '24 08:12 maximlt

  • [ ] We need to check why the files differ so much, are there more columns in census2020? If so, are they used? Or is the data stored differently (dtype, compression)?

I can point out here that the 2010 dataset contains ~306 million rows while the 2020 dataset contains over 334 million. That's an additional ~28 million rows. I'm sure that alone is not the reason for the large difference in file size but I think it also play a role?

Azaya89 avatar Dec 09 '24 15:12 Azaya89

  • [ ] We need to check why the files differ so much, are there more columns in census2020? If so, are they used? Or is the data stored differently (dtype, compression)?

I can point out here that the 2010 dataset contains ~306 million rows while the 2020 dataset contains over 334 million. That's an additional ~28 million rows. I'm sure that alone is not the reason for the large difference in file size but I think it also play a role?

That's not enough to make it 3x bigger.

maximlt avatar Dec 09 '24 15:12 maximlt

  • [ ] We need to check why the files differ so much, are there more columns in census2020? If so, are they used? Or is the data stored differently (dtype, compression)?

I can point out here that the 2010 dataset contains ~306 million rows while the 2020 dataset contains over 334 million. That's an additional ~28 million rows. I'm sure that alone is not the reason for the large difference in file size but I think it also play a role?

That's not enough to make it 3x bigger.

Sure!

Edit: I think I have found the issue: The 2010 dataset 'x' and 'y' columns are in Float32 dtype while the 2020 dataset are in Float64 dtype. I will update the 2020 dataset to use Float32 and have it re-uploaded.

Azaya89 avatar Dec 09 '24 16:12 Azaya89

Right; we explicitly chose 32 bits (and would use 16 bits if that were an easy option) to reduce file size.

Census 2020 is a special case of people expecting the data to be more up to date. The code should not be different from 2010, so the extra maintenance should not be prohibitive.

jbednar avatar Dec 09 '24 18:12 jbednar

The new dataset has been uploaded now. It's about 2.15 GB on disk.

Azaya89 avatar Dec 10 '24 10:12 Azaya89

Is the race field a categorical? Category should be much smaller than string

jbednar avatar Dec 10 '24 12:12 jbednar

So I downloaded the two ZIP files (census 2010 and 2020) and compared them.

File sizes (zipped): old 1.44GB vs new 1.62GB Parquet files in the folder: old 36 vs new 141 Race values: image Schemas: image Compression: both use SNAPPY

So while the file sizes are now closer, there are still a few differences:

  • way more partitions for the new one, what is the impact on performance?
  • I'm not sure why the x,y values aren't displayed as Float 32 in the new one (read with DuckDB)
  • ~~race is not stored as categorical in any of them, I'm not sure why?~~ see messages below
  • race values are different, shortened to one letter in the old file while complete in the new one

I'd also note that this example installs fastparquet. I think that in the end it's pyarrow that is used as the engine as it's installed (see the lockfile) and it has become the default engine in Dask (https://docs.dask.org/en/stable/changelog.html#fastparquet-engine-deprecated). Worth checking though.


I would still prefer if we had a single census example, the above shows that it's not hard for two similar examples to diverge enough to cause maintenance pain. How about updating the original census to use the 2020 dataset and dropping entirely the 2010 one? Pretty sure the text in the text in the original example still applies to the new dataset.

maximlt avatar Dec 10 '24 14:12 maximlt

Is the race field a categorical? Category should be much smaller than string

Yeah (before I changed the 'x' and 'y' types) image

Azaya89 avatar Dec 10 '24 15:12 Azaya89