grist-core Import API

We want to add a nice REST API endpoint to import from URLs or uploaded files as in the UI. Eventually this should also support importing from other Grist documents, so this is related to https://github.com/gristlabs/grist-core/issues/416. For starters I just want to propose a design and hear what others think, particularly @fflorent.

For reference, here’s the relevant types in the current code (click to expand)

function finishImportFiles(
  dataSource: DataSourceTransformed,
  prevTableIds: string[],
  importOptions: ImportOptions,
): ImportResult;

interface DataSourceTransformed {
  // Identifies the upload, which may include multiple files.
  uploadId: number;

  // For each file in the upload, the transform rules for that file.
  transforms: TransformRuleMap[];
}

interface TransformRuleMap {
  [origTableName: string]: TransformRule;
}

const NEW_TABLE = null;
const SKIP_TABLE = "";
type DestId = string | typeof NEW_TABLE | typeof SKIP_TABLE;

// How to import data into an existing table or a new one.
interface TransformRule {
  // The destination table for the transformed data. If null, the data is imported into a new table.
  destTableId: DestId;
  // The list of columns to update (existing or new columns).
  destCols: TransformColumn[];
  // The list of columns to read from the source table (just the headers name).
  sourceCols: string[];
}

interface TransformColumn {
  // Label of the column to update. For new table it is the same name as the source column.
  label: string;
  // Column id to update (null for a new table).
  colId: string|null;
  // Type of the column (important for new columns).
  type: string;
  // Formula to apply to the target column.
  formula: string;
  // Widget options when we need to create a column (copied from the source).
  widgetOptions: string;
}

interface ImportOptions {
  parseOptions?: ParseOptions;         // Options for parsing the source file.
  mergeOptionMaps?: MergeOptionsMap[]; // Options for merging fields, indexed by uploadFileIndex.
}

interface MergeOptionsMap {
  // Map of original GristTable name of imported table to its merge options, if any.
  [origTableName: string]: MergeOptions|undefined;
}

interface MergeOptions {
  mergeCols: string[];          // Columns to use as merge keys for incremental imports.
  mergeStrategy: MergeStrategy; // Determines how matched records should be merged between 2 tables.
}

interface MergeStrategy {
  type: 'replace-with-nonblank-source' | 'replace-all-fields' | 'replace-blank-fields-only';
}

interface ImportResult {
  options: ParseOptions;
  tables: ImportTableResult[];
}

interface ImportTableResult {
  hiddenTableId: string;
  uploadFileIndex: number;      // Index into upload.files array, for the file responsible for this table.
  origTableName: string;
  transformSectionRef: number;
  destTableId: string|null;
}

Here is the same thing but with types inlined, which I think makes it a bit more digestible:

function finishImportFiles(
  dataSource: {
    // Identifies the upload, which may include multiple files.
    uploadId: number;

    // For each file in the upload, the transform rules for that file.
    transforms: {
      [origTableName: string]: {
        // The destination table for the transformed data. If null, the data is imported into a new table.
        destTableId: string | typeof NEW_TABLE | typeof SKIP_TABLE;
        // The list of columns to update (existing or new columns).
        destCols: {
          // Label of the column to update. For new table it is the same name as the source column.
          label: string;
          // Column id to update (null for a new table).
          colId: string | null;
          // Type of the column (important for new columns).
          type: string;
          // Formula to apply to the target column.
          formula: string;
          // Widget options when we need to create a column (copied from the source).
          widgetOptions: string;
        }[];
        // The list of columns to read from the source table (just the headers name).
        sourceCols: string[];
      };
    }[];
  },
  prevTableIds: string[],
  importOptions: {
    parseOptions?: ParseOptions;         // Options for parsing the source file.
    mergeOptionMaps?: {
      // Map of original GristTable name of imported table to its merge options, if any.
      [origTableName: string]: {
        mergeCols: string[];          // Columns to use as merge keys for incremental imports.
        mergeStrategy: {
          type: 'replace-with-nonblank-source' | 'replace-all-fields' | 'replace-blank-fields-only';
        }; // Determines how matched records should be merged between 2 tables.
      } | undefined;
    }[]; // Options for merging fields, indexed by uploadFileIndex.
  },
): {
  options: ParseOptions;
  tables: {
    hiddenTableId: string;
    uploadFileIndex: number;      // Index into upload.files array, for the file responsible for this table.
    origTableName: string;
    transformSectionRef: number;
    destTableId: string | null;
  }[];
};

const NEW_TABLE = null;
const SKIP_TABLE = "";

ParseOptions is a weird case. Officially it’s this:

/**
 * ParseOptions contains parse options depending on plugin,
 * number of rows, which is special option that can be used for any plugin
 * and schema for generating parse options UI
 */
interface ParseOptions {
  NUM_ROWS?: number;
  SCHEMA?: ParseOptionSchema[];
  WARNING?: string;     // Only on response, includes a warning from parsing, if any.
}

/**
 * ParseOptionSchema contains information for generaing parse options UI
 */
 interface ParseOptionSchema {
  name: string;
  label: string;
  type: string;
  visible: boolean;
}

However NUM_ROWS isn’t really used, and SCHEMA and WARNING are returned by the Python parsing code rather than used for parsing. The actual options used for parsing are represented by this type:

type ParseOptionValueType = boolean|string|number;

 interface ParseOptionValues {
  [name: string]: ParseOptionValueType;
}

In client/components/Importer.ts is a field _parseOptions of type Observable<ParseOptions>, so supposedly it’s sending ParseOptions to the server. But the field actually contains meaningful options as indicated by a cast: this._parseOptions.get() as ParseOptionValues . The actual type is only determined at runtime, as it depends on the contents of SCHEMA .

URLs

POST /docs/:docId/imports/upload to upload one or more files. The body is in the same format as uploading attachments. It returns a single ‘import ID’, i.e. uploadId in existing code. Note that unlike attachments, only a single ID is returned, even if multiple files are uploaded.
POST /docs/:docId/imports with the request body below to actually perform the import.

Request body

type RequestBody = {
  source: { upload: number } | { url: string };

  // Replaces the `transforms` key from before
  // To skip a table, simply omit it from this array.
  tables: Array<{
    // Replaces the [origTableName: string] key from before.
    // Filename, Excel sheet name, etc.
    // Not required if only one table is uploaded.
    source?: string;

    target: { new: NewTableOptions } | { existing: ExistingTableOptions };
  }>;

  parseOptions?: CsvParseOptions | JsonParseOptions;
}

type NewTableOptions = {
  // Desired name of new table, used as page title and default widget title.
  // `tableId` will be derived from this.
  // If omitted, will be derived from `source` filename.
  // This option isn't in the UI, so we don't *need* to offer it now.
  name?: string;

  columns: Array<{
    // Column header name from uploaded table.
    // Replaces `sourceCols: string[]` from before.
    // Only columns mentioned here are imported.
    source: string;

    // id and fields are as in `POST /columns`.
    // Again, we don't *need* to offer these yet,
    // since the UI doesn't allow customizing them.
    id?: string;  // can be derived from `label`
    fields?: {
      label?: string;  // can be derived from `source`
      // I hestitate to offer these now, since it creates the expectation that
      // they will influence how values are parsed, which isn't currently true.
      type?: string;
      widgetOptions?: string | object;
    };
  }>;
};

type ExistingTableOptions = {
  // tableId or tableRef of table to import into.
  id?: string | number;

  columns: Array<{
    // Column header name from uploaded table.
    source: string;

    // colId or colRef of existing column
    id?: string | number;
    
    // Only allowed when importing into reference columns.
    // If true, treat values as row IDs
    // rather than looking up the value in the visible column.
    // Equivalent to choosing a source column with " (as row ID)" in the UI.
    rawReferences?: boolean;
  }>;

  // Omit to only add records instead of update
  merge?: {
    // Replaces `mergeCols: string[]` from before.
    targetColumns: Array<string | number>;
    strategy: 'replace-with-nonblank-source' | 'replace-all-fields' | 'replace-blank-fields-only';
  }
};

type CsvParseOptions = {
  lineTerminator?: string;
  fieldSeparator?: string;
  quoteChar?: string;
  doubleQuote?: boolean;
  firstRowIsHeader?: boolean;
  skipLeadingSpace?: boolean;
  encoding?: string;
};

type JsonParseOptions = {
  include?: string[];
  exclude?: string[];
};

Note that there's no option to apply formulas. In general, supporting formulas currently causes a few technical problems. It'd be good to have the option to implement imports in a different way which doesn't use formulas. We can't remove the existing implementation with formulas without risking breaking workflows for people using the UI, but the new API doesn't need to be compatible. A less aggressive/opinionated option is to support them now, and if we add an alternative implementation in the future, the appropriate implementation can be selected automatically based on whether formulas are used. But this will add complexity, and it means that users may use formulas when it's convenient but not necessary. I think it's best to avoid formulas for now and see if users give feedback.

Dec 13 '23 17:12 alexmojaki

Thanks Alex! :pray:

If we think ahead to the #416 issue, that endpoint would interest be a great key to its implementation:

POST /docs/:docId/imports with the request body below to actually perform the import.

When we will have to synchronize the tables, I guess we will want to reuse the same request body. I wonder if we could not store it server-side for that purpose (with an endpoint somewhat like POST /docs/:docId/imports/:importId).

Maybe this should be addressed at the moment we implement #416. But I would like to know whether I anticipate the next steps correctly.

Dec 14 '23 18:12 fflorent

Good question. I can see two possibilities:

POST /imports with something like save: true in the body means that the settings are saved and an importId gets returned so that POST /imports/:importId repeats the same import. You can only create new saved settings and use existing settings, no other operations.
POST /imports/saved creates a new saved settings object, and a bunch of other endpoints starting with the same URL provide other CRUD operations.

Either way, source: { upload: number } is not allowed in these cases.

Dec 14 '23 19:12 alexmojaki

Good question. I can see two possibilities:

I like the second option, it sounds more RestFUL.

Also to come back to your original post:

POST /docs/:docId/imports/upload to upload one or more files. The body is in the same format as uploading attachments. It returns a single ‘import ID’, i.e. uploadId in existing code. Note that unlike attachments, only a single ID is returned, even if multiple files are uploaded.

I wonder whether we could not return the columns headers as well through this endpoint. I am not sure about the technical implications if we go this way, but if our ambition is to use this API in the long-term for the UI as well (modulo we can afford supporting the formula as you mentioned), then it could be very convenient for the wizard step where the user maps the columns.

type NewTableOptions = {
  ...
  columns: Array<{
    // Column header name from uploaded table.
    // Replaces `sourceCols: string[]` from before.
    // Only columns mentioned here are imported.

What about columns: 'all' | Array<{...? So it would be straightforward to programmatically import the document as is with all the columns.

Dec 19 '23 14:12 fflorent

We could maybe add either of these options in the future. They seem like conveniences that are easy for users to manage without, and I don't think the current proposal gets in the way of adding them, except we should make sure that uploading returns something like {upload: number} rather than just a number.

The complication it adds is that the upload endpoint is just meant to store the file, not parse it. And if it did parse it, it would be guessing how, whereas the final endpoint is where the user can specify details about how to parse the file.

Dec 19 '23 16:12 alexmojaki

@alexmojaki I think we agree on the implementation strategy, do you expect more feedback from other people?

Jan 04 '24 12:01 fflorent

Thanks for approving it!

It might be nice to also get confirmation from @dsagal but I don't think there's a need to wait.

However tomorrow is my last day at Grist! So I won't be able to implement this myself.

Jan 04 '24 13:01 alexmojaki

I am on board with the proposal! Thanks!

Jan 04 '24 18:01 dsagal

@dsagal Does it mean that this issue can be assigned? (I can start working on it when I have the bandwidth)

Jan 09 '24 13:01 fflorent

Hi @fflorent, yes, if you're up for taking this on when you have bandwidth, then I'd be happy to assign it to you :pray: - just did so.

Jan 09 '24 17:01 paulfitz

I have just started working on it. I have some new questions:

POST /docs/:docId/imports/upload ⇒ it looks like the upload mechanism (handleUpload) is just not need a docId, so the docId is just needed for this endpoint to limit the number of requests (using throttled). Should we still require the user to pass a docId?
~~regarding RequestBody.columns[].source, I am not sure what it should correspond to? Is it the original file name when there are multiple files uploaded?~~ (I am pretty sure that yes after rereading the accompanying comment)

Thanks in advance! :pray:

Jul 18 '24 16:07 fflorent

The current upload mechanism relies on a /upload endpoint on the doc-worker that has the current document, and it is the client's responsibility to call it directly on that doc-worker. The proposed API is designed so that it can work with the home server, since docId would be used to find the right doc-worker.

Jul 23 '24 22:07 dsagal

It makes sense, thanks @dsagal!

Jul 24 '24 07:07 fflorent

After having taken a more thorough look at the code and at Alex's comment, I am torn between two options:

attempt to reuse the ActiveDocImport methods, which would mean having to transform the request body to fit the existing arguments type (DataSourceTransformed and ImportOptions), at least as a first step;
or reimplement the logic elsewhere (in a different class), which will run the python plugin used for the import;

I see that there are some incompatibilities between the existing and the new interfaces: some fields like the destination's colId are optional (the new implementation suppose that it could be derived from the label) and other are missing like the destination's table id (for the same reason).

If you have a strong opinion about the best strategy to follow already in mind, I would be glad to have it. In the meantime I continue to take a closer look at the existing code :monocle_face:.

Jul 31 '24 16:07 fflorent

I don't have a strong opinion about the best strategy. Alex's comment is a good suggestion but shouldn't be taken as holy writ. If some aspect of it requires a great deal of effort and doesn't lead to a big benefit, maybe change that aspect? Alternatively, could DataSourceTransformed and ImportOptions be extended or reinterpreted a bit to better match the new request body? If the only issue is around conventions about table/col ids, that might be possible.

Aug 09 '24 13:08 paulfitz

grist-core grist-core copied to clipboard

Import API

URLs

Request body

grist-core
grist-core copied to clipboard