marquez icon indicating copy to clipboard operation
marquez copied to clipboard

Update connectionUrl documentation in Datasource to make it clear that Marquez does not access the data

Open dmvieira opened this issue 3 years ago • 6 comments

Hi everybody! Amazing project, but isn't it a security breach? Why exactly Marquez needs access to source?

dmvieira avatar Dec 07 '20 23:12 dmvieira

Hi @dmvieira, Marquez does not need access to source. connectionUrl is there for reference as an identifier of the datasource where the dataset is coming from. This URL is not supposed to contain credentials.

julienledem avatar Dec 08 '20 00:12 julienledem

This URL is not supposed to contain credentials.

@julienledem: Agreed, though it can be a security concern as the username / password might be present in the connection URL. And, currently, the backend doesn't strip or redact the credentials if they're present.

Should we open an issue to strip out credentials in the connection if present?

wslulciuc avatar Dec 08 '20 00:12 wslulciuc

No? I was just reading API documentation Screenshot_20201207-215031-046

Why not change this name to "sourceIdentifier" or something like that?

dmvieira avatar Dec 08 '20 00:12 dmvieira

Yeah, I think we can word this a bit better. The URL can, technically, still be used to connect to the source, but the credentials wouldn't be stored in Marquez. For example, if I wanted to verify that an upstream job dependency I have successfully inserted all rows to a table, I can get the connection URL from Marquez, then verify the row count myself (assuming I had the credentials handy). Same idea can be applied to jobs. That is, a job can call out to Marquez programmatically to get the source info for a dataset my job needs to read as input, while assuming the credentials are present on the server running my job.

wslulciuc avatar Dec 08 '20 01:12 wslulciuc

Why not change this name to "sourceIdentifier" or something like that?

The source name is how Marquez uniquely identifies the source, but maybe just renaming connectionUrl to url would clarify it's usage? (Or just updating the description of connectionUrl in the API docs?)

wslulciuc avatar Dec 08 '20 01:12 wslulciuc

I really don't understand why only source name is not enough... If I know my service I know it credentials and connection details.

I was looking other issues like https://github.com/MarquezProject/marquez/issues/698 and I was thinking how connectionUrl is applicable to a Filesystem, Cassandra, Hadoop Spark job or other distributed resources.

If I really understand, in fact do you want to map metadata describing de source and not act like a credentials storage. What do you think about enable more flexible metadata instead of turn connectionUrl mandatory?

dmvieira avatar Dec 08 '20 15:12 dmvieira