Process never finishes syncing with remote MLFlow URI
I have set up an MLFlow tracking server running on AWS EKS. It uses S3 as an artifact store and AuroraDB to save metadata. I can't, for the life of me, get Aimmlflow to work.
I have created mounted volumes with appropriate AWS credentials so that Aim could communicate but, after apparently downloading all related artifacts, the aim sync proccess hangs with:
"Starting watcher on 'https://my-address'"
Maybe some boto3 credentials are needed in watcher.py? I'm not sure. Anyway, I really wish you guys the best, this is what made me give up on the product.
@rodrigomeireles thanks for the report. Actually the aimlflow sync command is supposed to stay working even after it finishes syncing the current logs(it's not hanging), cause it is meant to sync the upcoming changes as well in live mode, if you don't wish so, you can just stop the process with ctrl-C. Is there anything else wrong with the command?
@mihran113 thank you for the quick response. Well, if it's intended behaviour, then I don't see how should I structure my Dockerfile.
I want to deploy two services to the same namespace. One has already deployed and serves the MLFlow server. The other one, I thought, would be the aimlflow pod, which would sync the MLFlow experiments and serve the Aim UI so everyone in my company could access it.
Following the above plan, my aimlflow image pip installs aimlflow, runs init and syncs... But if sync never ends, how am I supposed to serve the UI?
@rodrigomeireles Thank you for bringing up the issue. For the aimlflow service, have you connected it to a persistent volume (stateful service). I was wondering, what happens if the pod/container dies and when it restarts, the service must authenticate itself with database service and access the modelexperiments and artifacts.??
@rodrigomeireles you can look at this page for a solution.