marquez
marquez copied to clipboard
[bug] creating a job with a output dataset wouldn't store output dataset
When we hit endpoint for creating a job and in body we pass output dataset (existing or not) we don't store it, response from that call has empty "outputs" field.
JobResource.createOrUpdate()
-> JobDao.upsertJobMeta
here we store only inputs
we use outputs only here when we create a new version of a dataset
Example:
PUT http://localhost:5000/api/v1/namespaces/postgres%3A%2F%2Flocalhost%3A6432/jobs/dvdrental.public.actor_info
Content-Type: application/json
{
"type": "BATCH",
"inputs": [],
"outputs": [{
"namespace": "postgres://localhost:6432",
"name": "dvdrental.public.actor_info"
}]
}
@wslulciuc can you please review it and close it if it's invalid case
Given this is part of the legacy write APIs, can we just move forward with deprecating the old APIs?
@collado-mike: We won't be deprecating the Dataset or Job APIs, but rather just the Runs API in favor of OpenLineage events to capture run-level metadata. We'll be working on updating the usage of each API to avoid any confusion. So, this issue is still relevant.
@wslulciuc We're also having the same issue. Is there a way to persist the output datasets with any APIs at all? Deprecated or not?
EDIT: The Marquez Client doesn't seem to have a direct way of creating a lineage event. This looks like it would solve the OPs issue.