calico icon indicating copy to clipboard operation
calico copied to clipboard

How to re-trigger failed flannel to calico migration?

Open Siddharood opened this issue 3 years ago • 4 comments

Expected Behavior

I want to rerun failed flannel to calico migration and make sure it completes and calico nodes are running in case of failures during flannel migration job.

Current Behavior

I have a 3 node kubernetes cluster with k8s 1.22 version with flannel setup and running fine. I did run live migration steps from flannel to calico as describes in here. Cluster migrated to calico successfully. However when I tried multiple times, couple of times I encountered below scenario.

  • flannel migration job is running even after 20 hours.
  • flannel to calico migration was not successfull.
  • two nodes out of 3 nodes(node2 and node3) were running calico node
  • one node(node1) was in schedulingdisabled mode and calico node was not running on the setup.
  • When I read logs from flannel migration job, it was trying to reach kube-apiserver and it was down during that time, hence it was failed.

Later after sometime when kube-api server was stable I tried re-running flannel migration job, it didn't proceed and nothing happened.

Possible Solution

Steps to Reproduce (for bugs)

  1. Install k8s cluster of 3 nodes with flannel
  2. run flannel to calico migration following the steps mentioned in here
  3. bring down kube-api server on one node

Context

When flannel to calico migration failed in middle with node scheduling disabled, cluster is not in ideal state.

Your Environment

  • Calico version: 3.21
  • Orchestrator version: kubernetes 1.22
  • Operating System and version: Ubuntu 18.04
  • Link to your project (optional):

Any help on this is appreciated.

Siddharood avatar Jul 19 '22 06:07 Siddharood

CC @song-jiang

caseydavenport avatar Jul 20 '22 21:07 caseydavenport

@Siddharood I know you summarized what was in your logs but could you share your logs anyways?

I'm not the flannel migration expert but looking at the migration controller, there's a check that the migration controller runs. Probably parts of your cluster were migrated.

lmm avatar Jul 26 '22 16:07 lmm

Thanks for the response @lmm. I will provide logs and also take a look at the check.

Siddharood avatar Jul 28 '22 05:07 Siddharood

@song-jiang - Any help on this appreciated.

Siddharood avatar Aug 05 '22 04:08 Siddharood

Sorry for the late reply. Lost github notification somehow.

Later after sometime when kube-api server was stable I tried re-running flannel migration job, it didn't proceed and nothing happened.

  1. Was flannel migration controller running when you rerun the job?
  2. What is the log of flannel migration controller? Normally the log will show why it's not proceeding.

song-jiang avatar Aug 23 '22 16:08 song-jiang

I'm going to close this for now due to inactivity but feel free to reopen, thanks.

lmm avatar Sep 20 '22 16:09 lmm