adlsguidancedoc
adlsguidancedoc copied to clipboard
Access to data
This section reads like documentation of ADLS Gen 2 and does not discuss access scenarios for the data itself. It needs to include the possibility of using no access control on the filesystem as the default option, and then explain scenarios when you might need to add access control. In my experience I have yet to see a good scenario where ACLs are a real requirement at the lake level.
My 50cent on the "access to data": I have created a concept called "the DataBank", which is designed for ADLS Gen1 & 2. It relies solely on using ACL's in the lake(s). The idea being that I want to have the data stored without any functionality, creating a possibility for my customers to use what technology they find appropriate in front of their data. This makes it possible for the data to be highly controlled. Working in public sector here Denmark we have rolled out this concept it is working. Some of my customers actually uses a service principal approach securing that a given product/services is the only "thing" allowed to get the data.
The problem there is that it adds massive complexity very quickly, along with all of the administration that entails. If you allow no humans access to the data then the data is even more controlled, and you are then in control of the access methods as well, meaning you are also able to prevent data egress. If end users have permissions on the actual data then you have no control at all over data egress because they can use their own access methods.
My gut feel is that there is no end to the moment we go down the path of indirect access via compute(clusters) to implement access control. Why don't i use something available natively than implementing indirect control? Why is doing the indirect control is the only way to do it, it might be an option but honestly i have never heard any customer ask for it.
The "indirect" option as you call it was the default until very very recently when we added support for fine grained access control. There are numerous issues with ACLs and scaling/admin. The number of users accessing a data lake is generally very small so it makes sense to use the more finessed controls in other systems and then control their access to the lake. Data lakes, by their nature, are not fine grained things so course grained access is generally a better first pass approach IMHO, with judicious use of ACLs being added only if/when needed. As I said above, direct ACLs open more control issues than they solve since they offer no way to prevent data egress by bad actors.
When we have support for it, why wouldn't we want to use it and go back to old way of doing things? Probably i am misrepresenting the issues you are thinking. Can you please list down the issues we see and if it comes to bad actors, that can be case with compute clusters through they are accessing too and i am worried that in that case they are misusing both storage and compute on top of it.
@jcordtz - would love to hear more about the data bank principle and if you are interested in contributing to the doc, that is very welcome as well. :)
@Davedoesdemos and @trinadhkotturu - good discussion. Good point on indirect access, we have definitely seen that pattern and would be good to add to the doc. If you are interested in contributing that is very welcome as well. :) I think it would help for the customer to think about access to the storage as well as access through other indirect mechanisms, taking a default position of no access to the data lake is an interesting perspective, we can add it if we see more evidence of that. We have seen patterns where there are workspaces provisioned to the data lake customers to bring their own data sets and do their analysis.

We must absolutely not advocate access to data using shared account keys (if this is what you mean). This is not meant for general audience and should be restricted for administrative tasks. Better solutions such as account-specific keys/service keys with restrictive storage access policies are good examples of a well constructed SAS tokens. Just calling out SAS tokens for non-AAD access may not be sufficient.
Also, can we have illustrations where it would be ok to have guest users created in AAD and allow them access to storage accounts?
Access data - can we have clarity on how the access should be granted to a different team and managed in due course of time
DLM - may be a case of para-phrasing. We do not have any "cooler" tier in SA. The data lands on Cool/Hot basis use case and ends in cool/archive as per the corporate DLM/Infosec policies
Agree with Raunak here, we need to be very specific and consistent on our language in how we explain this so as not to imply shared keys at the account level should be used for this but SAS keys can. My point was one of course vs fine access :)
Agree with Raunak on DLM - wording should be move to cool or archive tiers rather than cooler just to be explicit.