justice40-tool icon indicating copy to clipboard operation
justice40-tool copied to clipboard

As a data developer, I want to understand how much tile space is saved by short names.

Open lucasmbrown-usds opened this issue 3 years ago • 2 comments

Description Run an experiment with and without shortnames on the tiles in order to determine whether this step is really saving us much size on the tiles. If it's not saving any space, the writing to shortnames and then decoding them on the front-end is an unnecessary step.

Generate the tiles and check their file size.

lucasmbrown-usds avatar Feb 02 '22 00:02 lucasmbrown-usds

@widal001 tagging myself on this so I remember to follow up on it

widal001 avatar Jun 19 '22 20:06 widal001

@lucasmbrown-usds and @esfoobar-usds Thanks for your patience as I carved out some time to dig into this. Before running some tests I wanted to make sure I was understanding the goal and the portion of the related to this issue. I've summarized my current best understanding of the objective and how to actually validate this question, but would love to confirm with you all that my interpretation of how to conduct this test and evaluate the results is correct:

Summary of Current Process

  • In an attempt to reduce the file size of the tiles generated by etl_score_geo.py we are mapping the default "long" name for each column to a corresponding "short" name that will be stored in the tile data.
  • The mapping of long name to short name is captured by the TILE_SCORE_COLUMNS constant in etl/score/constants.py
  • And the actual renaming of the tiles from long to short happens in this step of GeoScoreETL.write_esri_shapefile()

Question to Test

Would preserving the "long" name of each column when we generate the tiles meaningfully increase the size of the output files, or prevent the tiles from building at all?

Testing Methodology

  • Add a feature flag shorten_column_names that when set to false will skip the .rename(columns=renaming_map) when writing self.geojson_score_usa_high to self.SCORE_SHP_FILE
  • Run two tests:
    • Control: One with the shorten_column_names flag set to true to get a baseline file size with short names
    • Experimental: A second with the shorten_column_names flag set to false to establish deviation from the baseline

Definition of Done

  • The two tests have been run and we've analyzed the differences in:
    • Output file size
    • Build time
    • Anything else?
  • An ADR has been drafted which outlines the findings from these tests and a recommendation for either continuing to use the short names or returning to the use of the full column names.

Outstanding Questions

  1. If I was reading this line in the GeoScoreETL.extract() method correctly, it looks like some of the columns in the self.TILE_SCORE_CSV already have the "short" name set before its read into this ETL class, but I couldn't find where that csv file was generated. Which ETL process generates the the tile score csv and would we also want to replace the short names in that file with their corresponding "long" name?
  2. Is there an existing set of unit tests and/or mock score files I can use to set up the control and experimental conditions? Or do I need to run the full ETL pipeline to generate the scores and then convert them to tiles?
  3. What is the threshold for a reduction in file size being "worth" the conversion from long to short names on tile generation and vice versa in the frontend?

widal001 avatar Jul 20 '22 03:07 widal001