justice40-tool
justice40-tool copied to clipboard
As a data developer, I want to understand how much tile space is saved by short names.
Description Run an experiment with and without shortnames on the tiles in order to determine whether this step is really saving us much size on the tiles. If it's not saving any space, the writing to shortnames and then decoding them on the front-end is an unnecessary step.
Generate the tiles and check their file size.
@widal001 tagging myself on this so I remember to follow up on it
@lucasmbrown-usds and @esfoobar-usds Thanks for your patience as I carved out some time to dig into this. Before running some tests I wanted to make sure I was understanding the goal and the portion of the related to this issue. I've summarized my current best understanding of the objective and how to actually validate this question, but would love to confirm with you all that my interpretation of how to conduct this test and evaluate the results is correct:
Summary of Current Process
- In an attempt to reduce the file size of the tiles generated by
etl_score_geo.py
we are mapping the default "long" name for each column to a corresponding "short" name that will be stored in the tile data. - The mapping of long name to short name is captured by the
TILE_SCORE_COLUMNS
constant inetl/score/constants.py
- And the actual renaming of the tiles from long to short happens in this step of
GeoScoreETL.write_esri_shapefile()
Question to Test
Would preserving the "long" name of each column when we generate the tiles meaningfully increase the size of the output files, or prevent the tiles from building at all?
Testing Methodology
- Add a feature flag
shorten_column_names
that when set tofalse
will skip the.rename(columns=renaming_map)
when writingself.geojson_score_usa_high
toself.SCORE_SHP_FILE
- Run two tests:
-
Control: One with the
shorten_column_names
flag set totrue
to get a baseline file size with short names -
Experimental: A second with the
shorten_column_names
flag set tofalse
to establish deviation from the baseline
-
Control: One with the
Definition of Done
- The two tests have been run and we've analyzed the differences in:
- Output file size
- Build time
- Anything else?
- An ADR has been drafted which outlines the findings from these tests and a recommendation for either continuing to use the short names or returning to the use of the full column names.
Outstanding Questions
- If I was reading this line in the
GeoScoreETL.extract()
method correctly, it looks like some of the columns in theself.TILE_SCORE_CSV
already have the "short" name set before its read into this ETL class, but I couldn't find where that csv file was generated. Which ETL process generates the the tile score csv and would we also want to replace the short names in that file with their corresponding "long" name? - Is there an existing set of unit tests and/or mock score files I can use to set up the control and experimental conditions? Or do I need to run the full ETL pipeline to generate the scores and then convert them to tiles?
- What is the threshold for a reduction in file size being "worth" the conversion from long to short names on tile generation and vice versa in the frontend?