model icon indicating copy to clipboard operation
model copied to clipboard

Do a global run of embeddings

Open brunosan opened this issue 8 months ago • 10 comments

We've been using Clay v1 embeddings directly, and via the Build/Explore apps. We've also done several types of partial benchmarking, so we are starting to feel comfortable with the quality of the model. We therefore should think about making large runs of existing open data and create embeddings, for our benefit to continue learning about Clay, but also to enable the community to leverage these open embeddings.

We still need to make decisions once we decide to make large runs:

  • Instrument? Sentinel?, NAIP? One Instrument, couple?
  • Unit of schema? Should do them at the file-level? Or spatial reference?
  • Spatial resolution? We've seen that many applications need the highest possible spatial resolution. Hence if Sentinel, a small tile size (but not too small to make the embeddings of lower quality). 128x128?
  • locations, time? Large coverage seems most important, but many users also request temporal changes. So I suggest either only wide spatial coverage, or 80% a large coverage run, and the 20% remaining many snapshots over time.
  • What format? I propose we wait and follow guidance from @cholmes on https://github.com/cloudnativegeo/geo-embeddings-survey
  • Hosted? source.coop
  • License? Open. Is CC-by best? OpenRail-M?
  • What is the cost of creation? It would be great to come up with a number.

Ideally, we can wrap this code to execute easily down the line, e.g. taking a STAC list and a spec file for chip_size, ... Note: Do not over-scope here, since we have the build app.

Probably out of scope, but the end-state at some point this year could be:

  • Sentinel-2 annual composites for EU
  • Sentinel-2 Level-2 files for a deforestation basin in Amazon with as many dates as possibe.
  • Same as above but Sentinel-1 files, or Landsat composites.
  • NAIP for whole states once.
  • NAIP for one state as many years as available.

Filing this also early to allow community requests, but we should aim to set a date for such run, e.g. end of June.

brunosan avatar Jun 11 '24 21:06 brunosan