marquez icon indicating copy to clipboard operation
marquez copied to clipboard

[bug] creating a job with a output dataset wouldn't store output dataset

Open OleksandrDvornik opened this issue 3 years ago • 4 comments

When we hit endpoint for creating a job and in body we pass output dataset (existing or not) we don't store it, response from that call has empty "outputs" field.

JobResource.createOrUpdate() -> JobDao.upsertJobMeta here we store only inputs

we use outputs only here when we create a new version of a dataset

Example:

PUT http://localhost:5000/api/v1/namespaces/postgres%3A%2F%2Flocalhost%3A6432/jobs/dvdrental.public.actor_info
Content-Type: application/json

{
  "type": "BATCH",
  "inputs": [],
  "outputs": [{
    "namespace": "postgres://localhost:6432",
    "name": "dvdrental.public.actor_info"
  }]
}

OleksandrDvornik avatar Oct 06 '21 09:10 OleksandrDvornik

@wslulciuc can you please review it and close it if it's invalid case

OleksandrDvornik avatar Oct 06 '21 09:10 OleksandrDvornik

Given this is part of the legacy write APIs, can we just move forward with deprecating the old APIs?

collado-mike avatar Oct 11 '21 21:10 collado-mike

@collado-mike: We won't be deprecating the Dataset or Job APIs, but rather just the Runs API in favor of OpenLineage events to capture run-level metadata. We'll be working on updating the usage of each API to avoid any confusion. So, this issue is still relevant.

wslulciuc avatar Oct 11 '21 22:10 wslulciuc

@wslulciuc We're also having the same issue. Is there a way to persist the output datasets with any APIs at all? Deprecated or not?

EDIT: The Marquez Client doesn't seem to have a direct way of creating a lineage event. This looks like it would solve the OPs issue.

kovaciad avatar Nov 08 '21 21:11 kovaciad