cf-for-k8s icon indicating copy to clipboard operation
cf-for-k8s copied to clipboard

Unable to deploy Stratos "unable to open database file"

Open braunsonm opened this issue 4 years ago • 22 comments

Describe the bug

On a newly deployed cluster on AKS I am unable to deploy Stratos. I don't think this is an issue with the Stratos team so I'm reporting this here.

Once you enable docker support on the cluster and push stratos, it starts up but the logs are full of errors and it is unreachable because the database cannot be created.

To Reproduce*

Follow the steps on the Stratos repository for installing as a CF application using the docker image.

Then run cf logs console, you will see the following:

   2020-10-02T16:14:50.63-0400 [APP/PROC/WEB/4e6d3960-affd-41a8-9c60-877732502041] OUT INFO[Fri Oct  2 20:14:50 UTC 2020] Waiting for database to be responsive: Unable to ping the database: unable to open database file
   2020-10-02T16:14:51.63-0400 [APP/PROC/WEB/4e6d3960-affd-41a8-9c60-877732502041] OUT INFO[Fri Oct  2 20:14:51 UTC 2020] Waiting for database to be responsive: Unable to ping the database: unable to open database file
   2020-10-02T16:14:52.63-0400 [APP/PROC/WEB/4e6d3960-affd-41a8-9c60-877732502041] OUT INFO[Fri Oct  2 20:14:52 UTC 2020] Waiting for database to be responsive: Unable to ping the database: unable to open database file
   2020-10-02T16:14:53.63-0400 [APP/PROC/WEB/4e6d3960-affd-41a8-9c60-877732502041] OUT INFO[Fri Oct  2 20:14:53 UTC 2020] Waiting for database to be responsive: Unable to ping the database: unable to open database file

Expected behavior

Stratos should startup normally.

Additional context

Cluster information

AKS

CLI versions

➜  cf-for-k8s git:(master) ✗ ytt --version
ytt version 0.30.0
➜  cf-for-k8s git:(master) ✗ kapp --version
kapp version 0.34.0

Succeeded
➜  cf-for-k8s git:(master) ✗ kubectl version
cClient Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.2", GitCommit:"f5743093fd1c663cb0cbc89748f730662345d44d", GitTreeState:"clean", BuildDate:"2020-09-16T13:41:02Z", GoVersion:"go1.15", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.8", GitCommit:"9f2892aab98fe339f3bd70e3c470144299398ace", GitTreeState:"clean", BuildDate:"2020-08-14T00:06:38Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}
➜  cf-for-k8s git:(master) ✗ cf version
cf version 7.1.0+4c3168f9a.2020-09-09

braunsonm avatar Oct 02 '20 20:10 braunsonm

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/175102948

The labels on this github issue will be updated when the story is started.

cf-gitbot avatar Oct 02 '20 20:10 cf-gitbot

We've seen problems with docker apps starting faster than the envoy sidecars and hence unable to reach external systems like a database. Is this possibly another instance of that problem?

loewenstein avatar Oct 03 '20 18:10 loewenstein

Not sure @loewenstein I'm not familiar enough to be able to tell if that's what happened

braunsonm avatar Oct 03 '20 18:10 braunsonm

from @ericpromislow

Hi @braunsonm,

The first problem is stratos needs to be configured with a database. Or you can use the all-in-one Stratos distro, but I couldn't get it to work. If you have a look:

cd stratos
docker build . -f deploy/Dockerfile.all-in-one -t YOURDOCKERNAME/stratos-aio:v1
docker push YOURDOCKERNAME/stratos-aio:v1
cf enable-feature-flag diego_docker
cf push console -o YOURDOCKERNAME/stratos-aio:v1 --no-manifest
curl -kL console.DOMAIN

Please let us know the outcome of this, successful or not.

jamespollard8 avatar Oct 14 '20 18:10 jamespollard8

@jamespollard8 I'm not sure that's really a fix. More of a work around. It should work without the AIO package as it just uses a local sqlite DB doesn't it?

braunsonm avatar Oct 14 '20 21:10 braunsonm

@jamespollard8 I'm not sure that's really a fix. More of a work around. It should work without the AIO package as it just uses a local sqlite DB doesn't it?

👍 @ericpromislow was the one taking a look at this - I'll let him respond

jamespollard8 avatar Oct 15 '20 00:10 jamespollard8

@jamespollard8 @ericpromislow This does seem to be an issue on AKS where the volume isn't ready when the container starts up. This can deploy normally on minikube with an exception.

It seems that the docker support in CF does not respect what ports are EXPOSE'd when creating the service. When deploying this in minikube it starts up but the service targets port 8080 in the pod. Where stratos exposes port 5443. You can fix this by running a patch like so:

kubectl patch svc s-2fe8eb28-b78d-4c84-8811-1d7878c21610 --patch \
  '{"spec": { "type": "ClusterIP", "ports": [ { "port": 8080, "targetPort": 5443, "protocol": "TCP", "name": "http" } ] } }'

braunsonm avatar Oct 20 '20 14:10 braunsonm

@braunsonm Very interesting. I wonder why it doesn't respect the port expose commands? This might be something we can dig into with the associated component teams.

Another item of note is that we recently changed the Eirini version we are using, so you may run into issues pushing a docker image based app if it does not specify a UID. Eirini 1.9.0 would only run images as UID 2000, so we are currently running a patched version that allows UIDs that are non-root. Consequently, images that do not specify a UID will try to run as root and the k8s security context will disallow scheduling.

Birdrock avatar Oct 20 '20 19:10 Birdrock

I assume a fix for that is in the works or is that something that is planned to be a limiting factor of running docker based apps on CF?

braunsonm avatar Oct 20 '20 19:10 braunsonm

We don't plan on it to be a long term limiting factor. We are deciding on how to best proceed. In the immediate term, it was a blocker due to the CNBs we use switching to UID 1000.

Here is the related Eirini PR

cc @paulcwarren

Birdrock avatar Oct 20 '20 20:10 Birdrock

@braunsonm Confirmed the EXPOSE directive is not getting respected. Filed a bug with capi-k8s-release: https://github.com/cloudfoundry/capi-k8s-release/issues/86. Thank you for reporting this.

To my understanding, even with the port-listening fixed, are you still running into issues related to running as root?

reneighbor avatar Oct 20 '20 23:10 reneighbor

@reneighbor No problem!

My issue isn't do to with running as root, it seems that since stratos creates a SQLite DB by default on startup, that is failing to work properly as I showed in the logs above. I cannot reproduce this with minikube but I can with AKS. So I'm guessing it might have something to do with the PVC not being up by the time the pods is started? But really not sure.

braunsonm avatar Oct 20 '20 23:10 braunsonm

I never did resolve this problem. By using the AIO container I could get the sqlite.db file built but still couldn't communicate with the stratos server. It might have even been a problem with istio communicating with the app.

ericpromislow avatar Oct 21 '20 17:10 ericpromislow

@braunsonm We'd like to troubleshoot this with you, but want to confirm what version of cf-for-k8s you're on. When we attempt to replicate, we get the following error: container has runAsNonRoot and image has non-numeric user (jetstream), cannot verify user is non-root", "crash_count"

A recent update to eirini enforces that the app user must be non-root (and numerical). It appears that the Stratos app sets the user as a non-numeric user "jetstream."

  • Can you let us know what versions of cf-for-k8s and eirini you're using?
  • If you update cf-for-k8s and repeat this test, do you get the same error we did?
  • If you work for Suse, would you edit the Dockerfile to change the user to a numerical value and let us know whether you still get the sqlite db error?

Thanks! Look forward to hearing from you, Renee

reneighbor avatar Oct 21 '20 17:10 reneighbor

  • Tested this with the current main branch, which is 0.7 I believe?
  • I now get the same error as you do. I assume this is just hiding the previous error I was getting.
   2020-10-21T14:33:22.00-0400 [API/0] OUT App instance exited with guid 72533447-3552-44fd-88e1-2e6366ae8f21 payload: {"instance"=>"console-test-space-59e6e7a918-0", "index"=>0, "cell_id"=>"", "reason"=>"CreateContainerConfigError", "exit_description"=>"container has runAsNonRoot and image has non-numeric user (jetstream), cannot verify user is non-root", "crash_count"=>0, "crash_timestamp"=>0, "version"=>"4ac4cb67-d54b-42c7-a123-bc8ead82a3e6"}

The previous error used to happen on 0.6 and 0.7 but that seems to have changed since Eirini is now a more pressing issue.

  • I do not work for Suse. I hope you plan to fix the eirini issue because that will really limit what containers people can run in CF. You don't always have control of that other image creators do with the users.

braunsonm avatar Oct 21 '20 18:10 braunsonm

Thanks @braunsonm !

We confirmed last week that after the following 2 modifications, we were able to get Stratos up and running:

  1. Edit the Stratos Dockerfile to use a numerical user (not string jetstream). An example that was edited with the all-in-one container is:
diff --git a/deploy/Dockerfile.all-in-one b/deploy/Dockerfile.all-in-one
index f57703953..9f428fd97 100644
--- a/deploy/Dockerfile.all-in-one
+++ b/deploy/Dockerfile.all-in-one
@@ -48,6 +48,6 @@ RUN usermod -aG users jetstream
 # Ensure that the /srv folder is in the users group so that the jetstream user can write to it
 RUN chgrp users /srv && chmod 775 /srv
-USER jetstream
+USER 2000
  1. kubectl patch the service to listen on port 5443. (This is to work around the ongoing cf-for-k8s which you helped us discover.)

Once the Docker image was rebuilt, after the app was push and the patch was completed, we were able to restart the app and access the Stratos UI.

We will file an issue with the Stratos team to address issue 1). The second issue is pending the cf-for-k8s team.

Thank you for your detailed troubleshooting :) Renee and @ericpromislow

reneighbor avatar Oct 26 '20 17:10 reneighbor

Hey @reneighbor have you tested this on AKS? Because this still does not address the root issue I had before Eirini got in the way where the volume would not be mounted to the container for the database to be initialized.

Also can you clarify #1? Why does a UID need to be provided as we have images that provide a name for the user instead. Do those not work now? Why doesn't eirini check the UID instead of the name?

Sorry for all the questions but your comment made things a little more confusing.

braunsonm avatar Oct 26 '20 17:10 braunsonm

Hey @braunsonm , there are additional details about the Eirini / bit-service bug with running on K8s in this Eirini story: https://www.pivotaltracker.com/n/projects/2172361/stories/175117727

I don't think we've been able to test this ourselves on AKS - using Stratos on AKS seems like a relatively niche use case so it's going to take us a little while to prioritize this issue (especially now that we found the work-arounds required to get Stratos running on GKE).

Sorry for all the questions but your comment made things a little more confusing.

No problem - we really appreciate your report here and working with us.

jamespollard8 avatar Oct 29 '20 21:10 jamespollard8

That's disappointing. @jamespollard8 you're basically saying that running any Docker images on CF in AKS is niche since the volume claims don't seem to be working properly?

I'll be interested to test this further with future CF and Stratos releases to see if things improve.

braunsonm avatar Oct 30 '20 20:10 braunsonm

@jamespollard8 you're basically saying that running any Docker images on CF in AKS is niche since the volume claims don't seem to be working properly?

Oh no - I definitely didn't mean to say that. To provide more context, we do test cf-for-k8s smoke tests (a cf push from source for a node app PLUS a docker image push) on AKS with the latest head of cf-for-k8s main branch each night. (here: https://release-integration.ci.cf-app.com/teams/main/pipelines/cf-for-k8s-iaas-tests/jobs/validate-azure)

Having not dug too deep personally on this particular issue, I assumed that the tests we're currently running would provide good general coverage for

  1. Docker images (because of what the smoke tests run) and
  2. volume claims (because the database that comes with cf-for-k8s and that's used in the AKS environment relies on PVs and PVCs)

@braunsonm Do you have an idea for how to improve test coverage on AKS, ideally with something more generic than "deploy stratos"?

That said, now that I've taken a better look, this does look like an important failure mode for us to look into. I appreciate you pushing back 👍

jamespollard8 avatar Oct 30 '20 21:10 jamespollard8

Interesting! I'm happy to see the tests seem to use PV and PVCs. It was only a guess based on what I heard in the CF Slack that could result in the issue above.

You seem to be doing everything right with your test coverage as long as you're trying to use the volume on the docker push to CF as soon as you can, that seemed to be what Stratos does and fails which is why I assumed it was a PVC issue but perhaps not.

No rush we can deploy Stratos directly on k8s instead but I think it would be a common use case to have a UI for cf-for-k8s when deployed on AKS.

Thanks for your continued help looking into this!

braunsonm avatar Oct 30 '20 21:10 braunsonm

Circling back with our latest understanding of the situation here:

  • We think the port-mapping issue may have been resolved by upgrading to Eirini release v2.0.0
  • We (specifically @Birdrock) put a bunch of work into trying to get the user ID issue sorted out but we were not able to get those to actually land in Eirini. (The implementation was brittle and didn't make sense to merge.)
  • Because of that, we're not able to resolve that issue with Stratos ourselves. You'll need to modify the Statos dockerfile as shown in Eric's tracker comment here
  • @braunsonm we'd recommend bumping this Stratos thread: https://github.com/cloudfoundry/stratos/issues/4715

Sorting this into "known issues"

jamespollard8 avatar Jan 20 '21 22:01 jamespollard8