metacatui
metacatui copied to clipboard
Design the MVP Plot Viewer for the PDG portal
This issue has two parts:
- [ ] Determine the essential features, types of plots, and data formats the plot viewer will support in its initial release.
- [ ] Create mockups of the MVP version of the Plot Viewer, including how the view will be integrated into the PDG portal
Ongoing dialogue on the Plot Viewer: A Summary
We have discussed various aspects of the plot viewer over a long period of time on the PDG team. I've attempted to compile points from the discussions below, but some may be missing. ⭐➡️ Feedback is welcome! ⬅️⭐
Existing mockups:
- 2020 - slide 52 in the proposal draft of the visualization tools in the PDG
- Apr 2023- 2D plots of existing data by Juliet
Examples highlighted by PDG team:
- Gapminder (Anna: this is what I imagine is our main guide in developing the PDG PlotViewer)
- USGS Viz Lab (Anna: This is mostly to see what others are doing, I do not see anything as useful as Gapminder tool’s here)
- Dynamic World (Anna: This is a mapping tool, just like the ImageryViewer is mapping tool. The PlotViewer is not a mapping tool,it is a graphing tool)
- GEE product
- Documentation for plot viewer
- Observable - streaming shapefiles
Importance of plot viewer:
- Combining datasets (2+) for lake change & IWP, or climate, or elevation, or ice content, or watersheds, etc.
- Inspire ideas for in-depth studies for why certain changes are observed in a small time period
- Showcase the most important data combinations to guide the user and limit the amount of work on the back-end
Handling large datasets:
- We can't send millions of points to the browser to be rendered in real-time, so we could:
- Divide datasets into regions that are small enough to render behind the scenes (watershed, or high ice vs med & low ice)
- May need to prepare and stage the possible plots on the backend first (preprocessing like we do for mapping) so there is no calculating in browser
- May need to aggregate data into bins/larger areas on the backend. This would reduce the number of points of data sent to the browser.
Potential datasets & variables for MVP:
- Need to focus on datasets containing information about change (change in lake area, change in RTS, change in fire scars, weather and climate etc).
- IWP - this dataset does not contain information about change, it is a supporting dataset to understand why change has happened, just like a DEM is a supporting dataset ()
- The Lake change analyses (by Ingmar) contain info about change
Discussion around plotting the Lake Change layer:
- Could aggregate the 30 million lakes into regions, by geology, permafrost type, ecoregion or any such layer
- Compare lake area, or change in lake area, and compare different regions against each other. For North Slope of AK to start with, we have different sediments and different ice contents.
- Example 1: For multi temporal data, x-axis can be time. One or two y-Variables (e.g. air temp). Can separate lake into regions.
- Example 2: lake change with climate (ERA5) (precipitation and air temperature) - Summer warmth vs lake size change with large area of arctic. - few hundred thousand points (too many?)
- Lake extent over time (new lakes forming, lakes growing, and lakes shrinking or even disappearing)
- Lake dynamics of the North Slope - many different landforms that we can use as supporting data (mainly two geologic regions)
- lake area over time? For 2 different regions
- Mean growth change rate of lake area? With error (stddev)?
- Pick two regions to compare? As an arbitrary footprint? Or existing boundaries like watersheds? Or by permafrost classification? Or geology classification?
- What are our response variables to plot? Carbon content of soil surrounding lakes
Features identified as important in the PDG mini-workshop survey (Apr 2023):
- Ability to plot climate variables from publicly available ERA5 reanalysis data
- Ability to plot supporting, already published geo-spatial data layers, e.g. soil, geology, vegetation, DEM, extreme weather events. Potentially use features in these datasets to subset our PDG data.
- Example: plot of permafrost thaw features (and their respective attributes such as size) versus extreme weather events or other permafrost (thaw) features, grouped by soil/veg type, elevation, or watershed
- Add legends / publication graphics export
- May want to give a DOI to plots that are generated. may want to combine the static and dynamic plots in the archiving, in that we may want to generate a static version of the plot along with the link to the dynamic app that generates it.
- See responses (restricted access)
Other topics discussed that we may need to consider:
- What will be the resolution/scale of the data?
- Winsorization? Other minor statistics like log transformation?
- What to do with regions with no data
- Units: can be extracted from the metadata for labelling. Do we need to convert units to be same for x / y axes?
- Types of plots for MVP? Scatter (2D, 3D, and 4D), Histogram, Box plot
We have discussed setting up a structured data access service on the backend of the ADC as an initial step towards integrating the plot viewer into portals. We outlined the structured data access service as the following in the Google.org proposal for the Discovery & analysis tools user interface working group:
Structured Data Access Service: Currently, model output data are in geospatial data formats, including GeoTIFF and GeoPackage, but are not optimized for delivery to analytical clients that want to access specific parameters over different spatial and temporal areas of interest. This task will create a data access service to restructure these datasets (e.g., using HDF5 or Zarr) to enable subsetting to arbitrary spatio-temporal windows, and aggregation across different resolutions, for delivery to the visualization plot viewer. (Team: Backend Engineer Fellow and Robyn, with input from Matt, Juliet) Deliverable 1: Documentation and requirements for a series of five use cases for data access for specific plot visualizations that should be supported by a data access service Integrating a Content-delivery Network into the structured data access service would be useful to speed up delivery of web-based geospatial data for visualization tools and scientific access. This may involve hosting data products on Google-supported cloud storage that has fast external access and Content-delivery Network functionality. Deliverable 2: using a sample of several PDG datasets, design a data storage structure that allows database querying to produce the data needed for those five use cases, specifying the inputs and outputs for each Deliverable 3: Prototype the Structured Data Access Service service against entire PDG datasets
Recently, Doug from Google.org requested a sample of several different tilesets that overlap the same small region on the PDG. He plans to use these samples for testing new palette features, but importantly this incentivized me to find the best way to do this before the bounding box and plot viewer tools are built. Essentially, my task is to find a region on the portal where several tilesets overlap, then retrieve the same handful of tiles (same z, x, and y filepaths) from each of the tilesets. Doug will likely find the rasters more useful than the vectors since he is working with palettes. For the region, I chose the north slope of AK around Utqiagvik where we have overlapping layers for infrastructure, permafrost extent, ice-wedge polygons, a local news story, and others. I tried 2 approaches to retrieve the filepaths of all the tiles that fall within the region:
- Devtools and
wgetcommand Use devtools while zoomed into the region of interest on the map, with the layers of interest toggled on. Navigating to the Network tab and moving aorund the portal view a little refreshes the list of tiles that are retrieved from the backend. Clicking on one of the tiles on the left pane opens up details, where you can preview the tile and the full path to the tileset from which it is retrieved:
Instead of navigating to the individual tile URLs and downloading 1 by 1, Ian suggested to copy and paste the filepaths into a text document and use wget -i tileset.txt to download them all.
morecantileandmercantileandwgetcommand Use these libraries in conjunction to initially find the filepaths that fall within a bbox around the ROI, then paste those filepaths into a text doc and use thewgetcommand. I initially picked a lat and long pair that falls in the ROI, then input that into:
mercantile.tile(lng, lat, zoom)
to retrieve the tile that contains that coordinate pair, then insert that output tile into:
tms = morecantile.tms.get("WGS1984Quad")
tms.bounds(morecantile.Tile(x, y, z ))
to get the bounds of that tile, then insert the output bounds and a z-level (9) into:
tiles = mercantile.tiles(-156.822070,71.266541,-156.533335,71.344341, 9)
tiles_list = list(tiles)
And the output shows several tilepaths like:
[Tile(x=32, y=108, z=9),
Tile(x=32, y=109, z=9),
Tile(x=33, y=108, z=9),
Tile(x=33, y=109, z=9)]
which could be saved to a text file and then input into the wget command. But retrieving the bounds of the tile that contains the coord pair was zooming to a different region in cesium, so I instead had more accuracy manually drawing a bbox on bbox finder and copying the right bounds from there. But there was still a problem: the output tiles from mercantile.tiles did not exist in the tilesets that cover this region when I looked within the infrastructure layer dirs, IWP layer dirs, etc. My first thought for the explanation is there were no polygons that overlapped the specific tiles that the mercantile package pulled as the tiles that fell within the bbox. But technically the package should have pulled ALL tiles that fell within the bbox for that z-level, so I should have gotten a mix of tiles that did exist in our tilesets and tiles that did not exist cause there was no polygons overlap for some. So something else is off.
Another note on wget commands:
Justin suggested that the best command to run to download all tile from a specified subdir is:
wget -r -np -nH --cut-dirs=3 -R '\?C=' -R robots.txt https://arcticdata.io/data/10.18739/{DOI}/
with whatever subdir you want tacked onto the end.
Hopefully this is helpful moving forward as we design the bbox drawing tool and plot viewer.
To follow up on my quest to query several PDG tilesets for both GeoTIFF and PNG files for a small region, I used an approach similar to the first approach outlined in the previous comment: a combination of devtools (to identify which z, x, and y tiles are of interest) and a python script that downloads the files by iteratively executing wget commands. I uploaded these dataset samples to a Google Drive for Doug. The python scripts are below.
download_geotiff_tiles.py
# Download a subset of a tilesets (GeoTIFFs) for several PDG layers
# Steps:
# 1. on the PDG portal, toggle on a layer of interest and zoom in
# 2. open devtools & the Network tab to display tiles in view
# (may need to pan around viwer to refresh list)
# 3. manually copy the end of the filepaths, just the z & x dirs
# (see the Header subtab) for ONLY the tiles from the data layer
# (not the base layer tiles), and paste all in this script below
# 4. prepend the z and x dirs with the URL to the GeoTIFF dir with {DOI}
# 5. make list of DOI's for all desired GeoTIFF tilesets in this script
# 6. execute script
import subprocess
from subprocess import Popen
# list DOIs for PDG layers:
# IWP,
# Lake Size Time Series (bands for both Seasonal Water and Permanent water),
# Infrastructure
DOIs = ["A2KW57K57/iwp_geotiff_high",
"A28G8FK10/yr2021/geotiff",
"A21J97929/geotiff"]
for DOI in DOIs:
print(f"Downloading files for DOI: {DOI}")
# list tile URLs for z and x dirs (copied from devtools)
URLs = [f"https://arcticdata.io/data/10.18739/{DOI}/WGS1984Quad/12/528/",
f"https://arcticdata.io/data/10.18739/{DOI}/WGS1984Quad/12/527/",
f"https://arcticdata.io/data/10.18739/{DOI}/WGS1984Quad/10/131/",
f"https://arcticdata.io/data/10.18739/{DOI}/WGS1984Quad/12/529/",
f"https://arcticdata.io/data/10.18739/{DOI}/WGS1984Quad/11/263/"]
for URL in URLs:
print(f"Downloading files for URL: {URL}")
# command to download all tiles in URL dir to current working dir
cmd = ["wget", "-r", "-np", "-nH", "-A", "*.tif", "--cut-dirs=2", "-R", "wget-log.*", URL]
process = Popen(cmd)
process.wait()
print("Script complete.")
download_png_tiles.py
# Download a subset of a tilesets (PNGs) for several PDG layers
# Steps:
# 1. on the PDG portal, toggle on a layer of interest and zoom in
# 2. open devtools & the Network tab to display tiles in view
# (may need to pan around viwer to refresh list)
# 3. manually copy the end of the filepaths, just the z & x dirs
# (see the Header subtab) for ONLY the tiles from the data layer
# (not the base layer tiles), and paste all in this script below
# 4. prepend the z and x dirs with the URL to the PNG dir with {DOI}
# 5. make list of DOI's for all desired PNG tilesets in this script
# 6. execute script
import subprocess
from subprocess import Popen
# list DOIs for PDG layers:
# IWP,
# Lake Size Time Series (both Seasonal Water and Permanent water),
# Infrastructure
DOIs = ["A2KW57K57",
"A28G8FK10/yr2021/web_tiles/seasonal_water",
"A28G8FK10/yr2021/web_tiles/permanent_water",
"A21J97929/SACHI_v2/web_tiles/infrastructure_code"]
for DOI in DOIs:
print(f"Downloading files for DOI: {DOI}")
# list tile URLs for z and x dirs (copied from devtools)
URLs = [f"https://arcticdata.io/data/tiles/10.18739/{DOI}/WGS1984Quad/12/528/",
f"https://arcticdata.io/data/tiles/10.18739/{DOI}/WGS1984Quad/12/527/",
f"https://arcticdata.io/data/tiles/10.18739/{DOI}/WGS1984Quad/10/131/",
f"https://arcticdata.io/data/tiles/10.18739/{DOI}/WGS1984Quad/12/529/",
f"https://arcticdata.io/data/tiles/10.18739/{DOI}/WGS1984Quad/11/263/"]
for URL in URLs:
print(f"Downloading files for URL: {URL}")
# command to download all tiles in URL dir to current working dir
cmd = ["wget", "-r", "-np", "-nH", "-A", "*.png", "--cut-dirs=3", "-R", "wget-log.*", URL]
process = Popen(cmd)
process.wait()
print("Script complete.")