marquez
marquez copied to clipboard
Update connectionUrl documentation in Datasource to make it clear that Marquez does not access the data
Hi everybody! Amazing project, but isn't it a security breach? Why exactly Marquez needs access to source?
Hi @dmvieira,
Marquez does not need access to source. connectionUrl
is there for reference as an identifier of the datasource where the dataset is coming from. This URL is not supposed to contain credentials.
This URL is not supposed to contain credentials.
@julienledem: Agreed, though it can be a security concern as the username / password might be present in the connection URL. And, currently, the backend doesn't strip or redact the credentials if they're present.
Should we open an issue to strip out credentials in the connection if present?
No? I was just reading API documentation
Why not change this name to "sourceIdentifier" or something like that?
Yeah, I think we can word this a bit better. The URL can, technically, still be used to connect to the source, but the credentials wouldn't be stored in Marquez. For example, if I wanted to verify that an upstream job dependency I have successfully inserted all rows to a table, I can get the connection URL from Marquez, then verify the row count myself (assuming I had the credentials handy). Same idea can be applied to jobs. That is, a job can call out to Marquez programmatically to get the source info for a dataset my job needs to read as input, while assuming the credentials are present on the server running my job.
Why not change this name to "sourceIdentifier" or something like that?
The source name is how Marquez uniquely identifies the source, but maybe just renaming connectionUrl
to url
would clarify it's usage? (Or just updating the description of connectionUrl
in the API docs?)
I really don't understand why only source name is not enough... If I know my service I know it credentials and connection details.
I was looking other issues like https://github.com/MarquezProject/marquez/issues/698 and I was thinking how connectionUrl
is applicable to a Filesystem, Cassandra, Hadoop Spark job or other distributed resources.
If I really understand, in fact do you want to map metadata describing de source and not act like a credentials storage. What do you think about enable more flexible metadata instead of turn connectionUrl
mandatory?