vSwarm Error in Container Creation on Running Benchmark

I encounter the following issue when trying to deploy the benchmark chained-function-serving and also aes (these are the only 2 benchmarks I have attempted to deploy). As a representative example, I will include the details pertaining to aes here.

Steps to reproduce the issue

I used the vSwarm-u profile on CloudLab to reproduce the issue on. Set up a single-node cluster as shown in the vHive quick-start guide and pull all the required images

git clone --depth=1 https://github.com/ease-lab/vhive.git 
cd vhive && mkdir /tmp/vhive-logs
./scripts/cloudlab/setup_node.sh;
sudo screen -dmS containerd containerd; sleep 5;
sudo PATH=$PATH screen -dmS firecracker /usr/local/bin/firecracker-containerd --config /etc/firecracker-containerd/config.toml; sleep 5;
source /etc/profile && go build;
sudo screen -dmS vhive ./vhive; sleep 5;
./scripts/cluster/create_one_node_cluster.sh

cd ..
sudo apt install docker.io
git clone --depth=1 https://github.com/ease-lab/vSwarm.git
cd vSwarm/benchmarks/aes
sudo make pull

Ensure that kubectl get pods -A shows all pods' status as Running or Completed (if not, wait till that happens). Now deploy a function. As a representative, we shall attempt to deploy kn-aes-go.

kubectl apply -f ./yamls/knative/kn-aes-go.yaml

which gives the output

service.serving.knative.dev/aes-go created

The Error

A CreateContainerError is encountered in one of the containers.

kubectl get pods -A

outputs (first line shown here)

NAMESPACE          NAME                                                            READY   STATUS                 RESTARTS   AGE
default            aes-go-00001-deployment-86749dbdf9-qcrzc                        2/3     CreateContainerError   0          15s

For the 3 containers user-container-0, user-container-1 and queue-proxy, here are the outputs of kubectl logs aes-go-00001-deployment-86749dbdf9-qcrzc -c ${CONTAINER_NAME} in that order:

time="2022-08-29T14:48:46Z" level=info msg="Started relay server at 0.0.0.0:50000"

time="2022-08-29T14:48:46Z" level=info msg="Start AES-go server. Addr: 0.0.0.0:50051\n"

Error from server (BadRequest): container "queue-proxy" in pod "aes-go-00001-deployment-86749dbdf9-qcrzc" is waiting to start: CreateContainerError

Therefore it is the queue-proxy container that generates the error. Take a look at the list of events (a part of the output of kubectl describe pod aes-go-00001-deployment-86749dbdf9-qcrzc)

Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  3m57s                  default-scheduler  Successfully assigned default/aes-go-00001-deployment-86749dbdf9-qcrzc to node-0.opt.rperf-pg0.utah.cloudlab.us
  Normal   Pulling    3m57s                  kubelet            Pulling image "docker.io/vhiveease/relay:latest"
  Normal   Pulled     3m54s                  kubelet            Successfully pulled image "docker.io/vhiveease/relay:latest" in 2.795450552s
  Normal   Created    3m54s                  kubelet            Created container user-container-0
  Normal   Started    3m54s                  kubelet            Started container user-container-0
  Normal   Pulling    3m54s                  kubelet            Pulling image "docker.io/vhiveease/aes-go:latest"
  Normal   Pulled     3m52s                  kubelet            Successfully pulled image "docker.io/vhiveease/aes-go:latest" in 1.87558108s
  Normal   Created    3m52s                  kubelet            Created container user-container-1
  Normal   Started    3m52s                  kubelet            Started container user-container-1
  Normal   Pulling    3m52s                  kubelet            Pulling image "docker.io/vhiveease/queue-39be6f1d08a095bd076a71d288d295b6@sha256:7664e43ef34eccf3c311a0a7fa75da472303faf387e3f5f0a5fb863a9dbc3aff"
  Normal   Pulled     3m49s                  kubelet            Successfully pulled image "docker.io/vhiveease/queue-39be6f1d08a095bd076a71d288d295b6@sha256:7664e43ef34eccf3c311a0a7fa75da472303faf387e3f5f0a5fb863a9dbc3aff" in 3.35848303s
  Warning  Failed     2m32s (x8 over 3m49s)  kubelet            Error: VM config for pod does not exist
  Normal   Pulled     2m32s (x7 over 3m48s)  kubelet            Container image "docker.io/vhiveease/queue-39be6f1d08a095bd076a71d288d295b6@sha256:7664e43ef34eccf3c311a0a7fa75da472303faf387e3f5f0a5fb863a9dbc3aff" already present on machine

Note the second last line which says VM config for pod does not exist. I saw the same mentioned on the vhive screen.

ERRO[2022-08-29T08:58:46.612409403-06:00] VM config for pod ef0e43e71a5343cca51f0bdfa0823db4e521c5f50d20b243255c6dc4c3971bce does not exist ERRO[2022-08-29T08:58:46.612459170-06:00] error="VM config for pod does not exist"

kn service list gives the output

NAME     URL                                            LATEST   AGE     CONDITIONS   READY     REASON
aes-go   http://aes-go.default.192.168.1.240.sslip.io            9m44s   0 OK / 3     Unknown   RevisionMissing : Configuration "aes-go" is waiting for a Revision to become ready.

Logs

kubectl describe pod aes-go-00001-deployment-fdd5c869b-dx6sz : kubectl-decribe-pod.log kubectl get service : kubectl-get-service.log kubectl get pods -A : kubectl-get-pods.log

Aug 29 '22 15:08 alannair

vSwarm functions' YAML files need to be modified in the following format to use firecracker MicroVMs instead of containers or gVisor VMs: https://github.com/ease-lab/vhive/blob/main/configs/knative_workloads/helloworld.yaml

See vHive Issue 68 (link is above).

@alannair could you add a note for this peculiarity to vSwarm's main README in a PR?

Aug 30 '22 13:08 ustiugov

The yaml (eg. kn-aes-go.yaml) files contain args such as addr and function-endpoint-url which are passed to the image. If we are to specify the image name and port env variables inside the stub image (as suggested), then how do we pass the args?

In addition, please clarify the following: The workaround involves running the function image within an external container that is configured to work with containerd. This external container is set up such that it initializes the image/port as per the respective environment variables. Is this correct?

Aug 30 '22 20:08 alannair

@alannair the stub image does nothing although it runs in the same pod. The sole purpose of the stub container is to serve heartbeats coming from knative & k8s. Ultimately, we should make sure the target container serves those messages on its own but for that we need to investigate the problem further.

I think the env variables are set up for all containers. The arguments are just runtime arguments supplied to the command to run inside the target container.

Aug 31 '22 10:08 ustiugov

@ustiugov I am able to deploy the functions by using the modified yaml format. Here is the modified aes-python manifest.

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: aes-python
  namespace: default
spec:
  template:
    spec:
      containers:
        - image: crccheck/hello-world:latest # Stub image. See https://github.com/ease-lab/vhive/issues/68
          ports:
            - name: h2c # For GRPC support
              containerPort: 50051
          env:
            - name: GUEST_PORT # Port on which the firecracker-containerd container is accepting requests
              value: "50051"
            - name: GUEST_IMAGE # Container image to use for firecracker-containerd container
              value: "docker.io/vhiveease/aes-python:latest"

As you can see, I have skipped the args parameters which were passed to the target container in the original manifest.

Problem is, while I am able to deploy the function successfully, invocation fails. Here is the output of ./invoker -port 80 -dbg -time 1 -rps 1 :

DEBU[2022-08-31T14:45:40.870005892-06:00] Debug logging is enabled                     
INFO[2022-08-31T14:45:40.870107586-06:00] Reading the endpoints from the file: endpoints.json 
DEBU[2022-08-31T14:45:40.870262284-06:00] Invoking: aes-python.default.192.168.1.240.sslip.io:80 
WARN[2022-08-31T14:45:40.891286725-06:00] Failed to invoke aes-python.default.192.168.1.240.sslip.io:80, err=rpc error: code = Unimplemented desc = Method not found! 
DEBU[2022-08-31T14:45:40.891392746-06:00] Invoked aes-python.default.192.168.1.240.sslip.io in 21130 usec 
INFO[2022-08-31T14:45:41.871316037-06:00] Issued / completed requests: 1, 0            
INFO[2022-08-31T14:45:41.871380829-06:00] Real / target RPS: 0.00 / 1                  
INFO[2022-08-31T14:45:41.871401948-06:00] Experiment finished!                         
INFO[2022-08-31T14:45:41.871419873-06:00] The measured latencies are saved in rps0.00_lat.csv

I am speculating here, but this is probably because I did not pass the args to the target container (right?).

But the new manifest does not instantiate the target container. It just instantiates the stub.

How then, does one pass args to the target container?
How exactly is the target container even instantiated from the environment variables (GUEST_IMAGE)?

Passing args to the stub container is futile.

Aug 31 '22 20:08 alannair

Is there a workaround for this now?

Nov 29 '22 08:11 jingren1021

@jingren1021 can you please specify for what exactly? If you refer to using vSwarm with Firecracker, then the YAML format changes are described above.

Sorry for the late response.

Dec 16 '22 03:12 ustiugov