Table and DataSet offset annotations

Open colin-alexa opened this issue 1 year ago • 0 comments

TL;DR

This adds a facility for adding tabular ("table-like", e.g. csv- or sql-compatible) data to offset documents, alongside a simple way to display that data using a table. Conversion and rendering logic is added for github-flavored markdown tables and for html tables which are exactly equivalent to those produced by github-flavored markdown.

The Pitch

The initial motivation for this change came from a desire to support markdown tables. As we considered the possibilities of table formatting we realized we didn't want to support tables as a general purpose layout tool. Rather, we felt it was more interesting and more useful to provide facilities around tabular data, or data which resembles a database table.

HTML and markdown tables are one obvious potential instance of tabular data, but the more interesting possibilities lie outside of the obvious. Well-structured tabular data can be used to power charts and graphs and other data visualizations. Even just more visually-complex table-like views (like product comparison tables) become simple to represent once we have a way of reasoning about tabular data in and of itself in the document.

In this paradigm, a typical rows and columns table layout is just one straightforward visualization of a set of tabular data, not conceptually distinct from something like a pie chart.

So, we landed on the DataSet annotation, a block annotation in the offset-annotations source which represents a collection of tabular data:

export class DataSet extends BlockAnnotation<{
  /**
   * A human-readable way to identify the dataset
   */
  name?: string;

  /**
   * An mapping of column names to ColumnType enum values where:
   *  the key is a unique, human-readable string
   *  the value indicates how the `jsonValue` of corresponding
   *   fields in the `rows` array should be interpreted
   */
  schema: Record<string, ColumnType>;

  /**
   * An ordered list of records, using the column
   * `name`s as the keys. The values are objects referring to
   * the contents of the cell with a slice id alongside a serialized
   * representation of the cell in `jsonValue`
   */
  records: Record<string, { slice: string; jsonValue: JSON } | undefined>[];
}> {
  static vendorPrefix = "offset";
  static type = "data-set";
}

a DataSet doesn't have any inherent rendering semantics. In rendering formats like HTML and CommonMark(*) the DataSet generator simply emits an empty string(**). In order to represent the DataSet visually, the document must include an annotation for the intended visual representation. For instance, a Table:

export class Table extends BlockAnnotation<{
  /**
   * The id of the DataSet from which the table data comes
   */
  dataSet: string;

  /**
   * Configuration for the columns of the table. The `id` fields correspond
   * to the values in the columnHeaders field in the dataSet. The order of
   * the columns in this array can be used to reorder columns from the
   * original dataset, and excluding columns from this array will exclude
   * them from rendering.
   */
  columns: Array<{
    name: string;
    slice?: string;
    textAlign?: "left" | "right" | "center";
  }>;

  /**
   * Tables may decide whether or not to display the column headers
   * on the underlying data.
   */
  showColumnHeaders?: boolean;
}> {
  static vendorPrefix = "offset";
  static type = "table";
}

The logic concerning the actual visual representation of the data belongs to the Table annotation. The Table selects which columns of the data to display, and in what order to display them. It also contains formatting information about how to align the text in each column, and has pointers to slice annotations in the document to control how the column headers are textually rendered.

* technically tables aren't a commonmark feature, but they are included in the commonmark renderer for simplicity as a common markdown extension

** In other rendering contexts it might make sense to produce rendered output for the DataSet, however. For instance, one could imagine a React renderer that represents a DataSet as a button to download a .csv file of the data.

Implementation notes

Support for markdown is the primary constraint in our pipeline in terms of what kinds of datasets and visualizations we can support within our internal content. Because of that, our support for HTML tables is constrained more than it strictly needs to be because the conceptual model is based on markdown tables. Specifically, HTML tables may have complex header structures, may have cells which are merged across multiple columns, and other formatting oddities which have arguably sensible interpretations in a tabular format but which are unrepresentable in markdown. The code in this PR attempts to identify HTML tables with irregular head structures, but any unsupported body formatting has officially unspecified behavior.

The primary concession to the freeform nature of HTML tables is that the Table annotation in this PR has a flag for whether or not to display any head section on the table at all, and the parsing and conversion logic has a fallback for such cases where the produced DataSet will have generic column names like column 1, column 2, etc.

Mar 05 '24 22:03 colin-alexa