[Multi User] Support separate metadata for each namespace
Part of #1223, since we close it, we need a separate issue to track this feature.
Support separate metadata for each namespace help us only see related artifact/executations.
Currently, MLMD doesn't have user/namespace concept to isolate metadata based on user. An workaround we can move forward is to aggregate artifacts/executations by existing experiment and runs in user's namespace. This will end up some MLMD queries and I am not sure how's the performance especially in large scale.
Thumbs up if this is something you need.
/kind feature
I didn't find existing issue to track this story. If there's one, please let me know
I remember there's a very related one in TFX repo: https://github.com/tensorflow/tfx/issues/2618
I am assuming this one is talking about supporting multi-tenancy through k8s-native way -- namespace while that one is more about built-in multi-tenancy support in MLMD itself.
@numerology Yeah, If MLMD can add support for multi-tenancy. that would be great. Pipeline project can make corresponding changes.
I am assuming this one is talking about supporting multi-tenancy through k8s-native way -- namespace while that one is more about built-in multi-tenancy support in MLMD itself.
Yeah, that's true. If MLMD doesn't have plan to support it, we can still have the workaround to aggregate metadata at the namespace level
@Jeffwan @numerology @Bobgy
Let me mention some points I think can be considered along with this issue related to artifacts list page. I did not check executions page yet.
-
Issue 1: data retrieval in ArtifactsList.tsx hangs with big number of artifacts and can be optimized. Following endpoint call is not necessary (seems this was added when ml_metadata was not returning the creation-time of the artifact): https://github.com/kubeflow/pipelines/blob/421211087cc7d1daabc3b4e3a3c6082b4b0d8616/frontend/src/pages/ArtifactList.tsx#L211
We implemented this optimization internally at PwC and the page does not hang anymore even with big number of artifacts.
If we put our optimization upstream, there is a couple of options:
Option 1 Do nothing else, that is, keep pagination disabled and keep filtering and sorting client-side until mlmd supports server side filtering (with predicates) and sorting. Drawback is that although artifacts will be rendered, it can be slow depending on number of artifacts in mlmd. Tested with 35000 artifacts in mlmd took ~50 secs to load all artifacts. Filter/sort ~15 secs.
Option 2 Enable pagination by using server-side feature in mlmd. Pagination is currently available in mlmd. We tried this internally and paginated data is rendered almost immediately. However, sort/filter in the frontend client will act on data fetched in memory but not all data is in memory, only the portion of it that corresponds to the current pagination. A solution so ArtifactList uses pagination, sorting and filtering in mlmd side, I think is blocked until mlmd supports filtering with predicates and more flexible sorting (sorting in mlmd is limited to creation-time/update-time/id)
-
Issue 2: There is a ParentContext feature available in mlmd v1.0.0. Have you checked into this? Maybe can be used to achieve separation per namespace.
I'll appreciate your thoughts and comments.
CC/ @maganaluis
Server side filtering is available in ml-metadata 1.2.0!! https://github.com/google/ml-metadata/blob/839308f502f97299ce5ee02852ca86e702211386/ml_metadata/proto/metadata_store.proto#L852
For anyone interested, for your namespace separation requirements, do you want metadata DB to be
- one instance per namespace
- shared instance per cluster and use namespace context to filter stuff in a certain namespace
WIth 1, we can build access control using Istio Authorization. With 2, IIUC, Istio Authorization needs to parse the requests and understand which namespace it's querying on. That's probably not possible right now given the requests are in gRPC not HTTP.
@bobgy is there any progress or decision made on this issue?
For anyone interested, for your namespace separation requirements, do you want metadata DB to be
1. one instance per namespace 2. shared instance per cluster and use namespace context to filter stuff in a certain namespaceWIth 1, we can build access control using Istio Authorization. With 2, IIUC, Istio Authorization needs to parse the requests and understand which namespace it's querying on. That's probably not possible right now given the requests are in gRPC not HTTP.
@chensun @zijianjoy
maybe we should use a proxy as it is done for katib-mysql and katib-db-manager. https://github.com/google/ml-metadata/issues/141 suggests that we just have to add a namespace/profile/user column and filter by it
@bobgy @zijianjoy istio should support grpc filtering now https://github.com/istio/istio/issues/25193#issuecomment-653554042
@ca-scribner would you be interested to implement this envoy filter? I am still busy with the minio stuff.
Following up on this item:
I am leaning towards creating one MLMD instance per namespace. This is because we should consider the data lifecycle of MLMD information. When a namespace is deleted, we should have an approach to easily clean up data related to this namespace. This might not be easy with MLMD nowadays using single MLMD instance, because delete operation is not supported by design: https://github.com/google/ml-metadata/issues/38. Thus starting the separation from the beginning is my current preference.
That said, I am aware that one MLMD instance per namespace probably mean resource usage overhead for a cluster with many namespaces. So we should consider using something like pod autoscaler: https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale-walkthrough/. This problem we are facing is similar to the artifact-ui scalability problem as well: https://github.com/kubeflow/pipelines/issues/9555
@zijianjoy amazing, that this is tackled. We need to have this anyway as CNCF graduation requirement. The CNCF will do a security assessment and this is a clear security violation.
I think the artifact per namespace visualization server should be removed anyway, since it is deprecated and the artifact proxy is obsolete as well as explained here https://github.com/kubeflow/pipelines/issues/9555#issuecomment-1643559018.
That means currently you can already have zero overhead namespaces if you drop old garbage. I know of Kubeflow installations with several hundred namespaces, so that is a real problem customers are facing. I can create a PR to make that the default and fix the security issue i have found a few years ago with code from @thesuperzapper https://github.com/kubeflow/pipelines/issues/8406#issuecomment-1640918121
In the long term i would propose switching to MLFlow, since that seems to be the industry standard, but if that is not possible due to google policies we should consider something with minimal footpprint. Maybe knative serverless per namespace Nevertheless i still prefer a single MLMD instance for the time being to keep supporting zero overhead kubeflow namespaces and find a proper solution for the long-term, so not MLMD.
Following up on this item:
I am leaning towards creating one MLMD instance per namespace. This is because we should consider the data lifecycle of MLMD information. When a namespace is deleted, we should have an approach to easily clean up data related to this namespace. This might not be easy with MLMD nowadays using single MLMD instance, because delete operation is not supported by design: google/ml-metadata#38. Thus starting the separation from the beginning is my current preference.
That said, I am aware that one MLMD instance per namespace probably mean resource usage overhead for a cluster with many namespaces. So we should consider using something like pod autoscaler: https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale-walkthrough/. This problem we are facing is similar to the artifact-ui scalability problem as well: #9555
One MLMD instance per namespace is bad for a Governance aspect. What if I want to track all assets produced by the company, like a catalog? This would require querying multiple MLMD API Servers. There should be a way to prevent unwanted access through Istio, thus creating a solution that does not depend on MLMD developers to implement.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
/lifecycle frozen