CumulusCI icon indicating copy to clipboard operation
CumulusCI copied to clipboard

Mapping files with steps that are not 1-1 with SObjects are unreliable for extraction

Open prescod opened this issue 5 years ago • 3 comments

Describe the bug

As much as possible, mapping files should be reusable rather than purpose-specific. Mapping files can have multiple steps associated with the same SObject because different steps can load different subsets. There are at least two bugs if you try to use such a mapping file for data extractions.

Reproduction steps Steps to reproduce the behavior:

  1. Download this file: data_recipe_with_rd2_mapping.yml.txt

  2. Run this command:

cci task run extract_dataset -o mapping data_recipe_with_rd2_mapping.yml --org qa

Context

CumulusCI version: 3.18.0 (/Users/pprescod/.local/bin/cci) Python version: 3.8.5 (/Users/pprescod/.local/pipx/venvs/cumulusci/bin/python)

You have the latest version of CumulusCI.

Traceback

File "/Users/pprescod/.local/pipx/venvs/cumulusci/lib/python3.8/site-packages/cumulusci/tasks/bulkdata/extract.py", line 80, in _run_task self._init_db() File "/Users/pprescod/.local/pipx/venvs/cumulusci/lib/python3.8/site-packages/cumulusci/tasks/bulkdata/extract.py", line 103, in _init_db self._create_tables() File "/Users/pprescod/.local/pipx/venvs/cumulusci/lib/python3.8/site-packages/cumulusci/tasks/bulkdata/extract.py", line 301, in _create_tables self._create_table(mapping) File "/Users/pprescod/.local/pipx/venvs/cumulusci/lib/python3.8/site-packages/cumulusci/tasks/bulkdata/extract.py", line 310, in _create_table t = create_table(mapping, self.metadata) File "/Users/pprescod/.local/pipx/venvs/cumulusci/lib/python3.8/site-packages/cumulusci/tasks/bulkdata/utils.py", line 109, in create_table t = Table(mapping["table"], metadata, *fields) File "", line 2, in new File "/Users/pprescod/.local/pipx/venvs/cumulusci/lib/python3.8/site-packages/sqlalchemy/util/deprecations.py", line 139, in warned return fn(*args, **kwargs) File "/Users/pprescod/.local/pipx/venvs/cumulusci/lib/python3.8/site-packages/sqlalchemy/sql/schema.py", line 537, in new raise exc.InvalidRequestError( sqlalchemy.exc.InvalidRequestError: Table 'Account' is already defined for this MetaData instance. Specify 'extend_existing=True' to redefine options and columns on an existing Table object.

Additional information Add any other context about the problem here.

prescod avatar Sep 12 '20 16:09 prescod

Do we have a good reason to fix this as such, or can we consider either (a) being more aggressive in deprecating the legacy filters and record_type clauses, or (b) adding validation rules to ensure the mapping file is constructed safely?

It does work if

  • The steps have different table names (table is now inferred if not specified).
  • The steps all specify a record_type (in this case CCI will over-extract for the second Account step because filters aren't applied on extract).
  • The mapping just uses RecordTypeId instead of filtering with the legacy features.

davidmreed avatar Sep 12 '20 18:09 davidmreed

I'm open to various solutions, but the code to do the validation may not be significantly more than the code to generate a table with all of the columns that all of the steps need. But yeah, I don't feel strongly about how we deal with it. as long as. our messaging is better than a traceback.

The whole idea of using a "mapping file" which is really an imperative set of steps for both import and export may not be the right abstraction. Exports should be easily parallelized (if we decide to do that) whereas loads need more serialization.

I'm curious whether you know if cases where the customer needs to control the export order or whether the system is free to do it in any order as long as links are fixed up by the end?

prescod avatar Sep 13 '20 19:09 prescod

I completely agree that extraction steps are/should be idempotent and parallelizable.

There are no possible side effects to changing the extraction order that I'm aware of, although there are multiple relevant governor limits that could bite us on parallelization.

Generally speaking, I really don't like breaking the 1:1 connection between tables and sObjects - I find it harder to reason about and think that in most cases it's being used as a patch around other weaknesses in the tooling that I'd rather we fix, like we did with record type support. But I'm open to being shown I'm wrong and there's real use cases there.

davidmreed avatar Sep 13 '20 21:09 davidmreed