vizier-scala icon indicating copy to clipboard operation
vizier-scala copied to clipboard

Proposal: Extensible Artifact Model

Open okennedy opened this issue 6 months ago • 0 comments

Challenge

Vizier's current data model is:

  1. Tightly coupled to Apache Spark: This brings in a 600MB dependency (technically 1.2GB, since pip ends up importing it a second time for Python compatibility).
  2. Very ad-hoc: Type translations are developed organically, on an as-needed basis.
  3. Reliant on 'canonical' types: Every data value has a canonical type. This often necessitates redundant, or unnecessarily proactive translations, most commonly with the Dataset type. For example, instead of easily allowing Pandas to interpret a LoadDataset('csv') with pd.load_csv; we have to go through spark.
  4. No notion of multiple-role objects. For example, a CSV file is a file, but could also represent a dataframe defined over the file. Presently, it's possible to have both, but you need separate artifacts for each.
  5. No support for transient artifacts --- artifacts created temporarily as cache.

Proposal Summary

  1. Provide Interfaces, Implementations, and Rust-Style Into[]/From[] adaptors; mainly with an eye towards decoupling how Vizier and language servers interact with artifacts (Interfaces/Mixins), from the underlying representation of the artifact.
  2. Introduce the notion of 'cache' artifacts

Concrete Proposal

The core idea is to decouple the physical representation of an artifact from the ways in which user code interacts with it. This breaks down into four concepts:

  • Encoding: The physical representation of the artifact
  • Interface: A conceptual 'role' that an artifact may play (e.g., Dataset, Image, or Integer), defined as a set of methods.
  • Implementation: Implementations of the methods of an Interface for a specific Encoding (or for an Interface).
  • Conversion: Code that translates one Encoding into another Encoding (or an Interface into an Encoding)

Encoding

At present, Vizier's representation of artifacts consists of a small, opaque blob of text data (typically json). These are interpreted based on the specific type of artifact, but the interpretation is entirely unstructured and performed on read. There is no common structure to the artifacts. This, in particular, makes things like reachability checks hard, since inter-artifact dependencies (e.g., a SQL query over existing tables) always need to be implemented ad-hoc.

The first major goal is to define a schema definition language for Artifacts. The schema definition needs to capture:

  • Serialization Standards (e.g., how the structure maps to JSON)
  • Type constraints (e.g., signed-ness and bit length of integers)
  • Nested dependencies (e.g., references to artifacts, and large-content/blob data on which the artifact depends)

Then, we define encodings for all of the existing artifact types, perhaps strengthening them somewhat (e.g., explicitly typed primitives, instead of generic parameters).

To emphasize the point, an encoding simply gives a name to the physical manifestation of the artifact, and dictates how it is stored in the database. This should be the minimum required to reproduce the artifact (see Artifact Caching below); and can should disregard any data that is only needed for efficiency (e.g., the URL of a file, but not the contents).

Some TODOs:

  • Design the schema language; implement it as Scala Case Classes, or similar.
  • Map all existing Artifact Types into the Encoding framework
  • Replace ArtifactType and its kin in the columns of the Artifact table with a reference to the Encoding used for the artifact.
  • Replate the hodgepodgy Artifact.describe / summarize with something more sensible based on the encoding.
  • Elide all references to SparkSchema/SparkPrimitive, replacing them with references to Encodings. In particular Dataset schemas should be based on Encodings, rather than Spark DataTypes

Interface

At present, Vizier uses ArtifactType and MIME types to differentiate different roles that an artifact can play. The Interface plays a similar role, by dictating a specific API to which an artifact can conform (i.e., governing how Vizier, its subsystems, and the user interacts with it). Some examples include:

  • Dataset (Consider the many types of dataset we have right now)
  • Image (png, jpg, etc...)
  • File (independent of format)

Some TODOs:

  • Design the interface language; implement it as Scala Case Classes, or similar
  • Design Interfaces for all existing ArtifactTypes
  • Add a 'Summary' interface.
  • Allow Interfaces to provide descriptions (e.g., to replace Artifact.describe)

Implementation

(An Encoding -> Interface, or Interface -> Interface edge)

In order to decouple Encoding and Interface, we need a binding between the two. Somewhere in the code, we need to be able to define code that implements a specific interface for a specific encoding. (e.g., how do I get the spark dataframe for a CSV file; How do I get the arrow dataframe, etc...).

Some TODOs:

  • We'll need a router; something to figure out which Implementation to invoke for a particular Encoding/Interface pair. This becomes harder if we want to allow Implementations from Interface to Interface

Conversion

(An Encoding -> Encoding edge)

This is more/less the same as an implementation, save that it generates a new encoding (and consequent additional data)

Platform Interactions

Generic artifacts necessitate decoupling Vizier from its target platforms, including Spark (but also Scala and Python). This means that we need a code component to translate an Encoding of an artifact into the platform-native equivalent. The natural approach here is to define a set of tiered fallbacks:

  1. Platform-provided logic for directly translating an encoding into a platform-native representation (e.g., CSV File -> Spark Dataframe)
  2. Fall back to platform-provided logic for translating an encoding that implements a specific interface into a platform-native representation (e.g., Function)
  3. Fallback through conversions to an encoding that is supported by case 1 or 2. (e.g., convert dataframe to arrow -> spark)
  4. Fallback to just providing the encoding directly (e.g., as the JSON-serialized artifact)

Artifact Caching

[more to come]

okennedy avatar Aug 08 '24 21:08 okennedy