amazon-vpc-cni-k8s icon indicating copy to clipboard operation
amazon-vpc-cni-k8s copied to clipboard

log rotation problem

Open Jaeyo opened this issue 3 years ago • 10 comments

in cni, lumberjack used for log rotation. but lumberjack assumes that only one process is writing to the output files. but cni can executed with multiple processes. so the log rotation results in improper behavior.

Jaeyo avatar Nov 17 '21 06:11 Jaeyo

@Jaeyo - Can you provide bit more context on what you are seeing. Lumberjack seems to be safe to use from multiple go routines as noted here. aws-node DS doesn't have multiple process. There is no supported use case with AWS VPC CNI that would require multiple instances of aws-node on same node/instance. Are you concerned about other processes in the instance writing to same ipamd.log file present in host file system ? Would be great if you can share the specific scenario that is creating impact on log file.

srini-ram avatar Nov 18 '21 19:11 srini-ram

@Jaeyo - Can you provide bit more context on what you are seeing. Lumberjack seems to be safe to use from multiple go routines as noted here. aws-node DS doesn't have multiple process. There is no supported use case with AWS VPC CNI that would require multiple instances of aws-node on same node/instance. Are you concerned about other processes in the instance writing to same ipamd.log file present in host file system ? Would be great if you can share the specific scenario that is creating impact on log file.

@srini-ram as you mentioned, there's only one DS per node. but the cni binary file is executed by container runtime. therefore, even if there is only one DS, the cni binary may be executed several times at the same time.

to explain in more detail, DS just places the cni binary in the cni bin path of work node. when new pod created, container runtime execute cni binary with parameter (CNI_COMMAND=ADD) and network configuration via stdin. so, there can be multiple running cni process at the same time.

as you mentioned, lumberjack is safe in multiple go routine, but not safe in multiple process. check this link please.

below is test for multiple process.

if i changed log MaxSize to 1MB for test like below,

func getLogWriter(logFilePath string) zapcore.WriteSyncer {
	lumberJackLogger := &lumberjack.Logger{
		Filename:   logFilePath,
		MaxSize:    1,
		MaxBackups: 5,
		MaxAge:     30,
		Compress:   true,
	}
	return zapcore.AddSync(lumberJackLogger)
}

and write simple logging logic like below.

package main

import (
	"github.com/aws/amazon-vpc-cni-k8s/pkg/utils/logger"
	"os"
)

func main() {
	os.Setenv("AWS_VPC_K8S_CNI_LOG_FILE", "./logs/test.log")
	log := logger.Get()

	for i := 0; i < 100000; i++ {
		log.Info("hello, world")
	}
}

if i run this main function just once, the log rotated properly. but if i run this function multiple times at the same time, the problem is reproduced.

$ go run cmd/test/main.go& go run cmd/test/main.go
[1] 82443
[1]  + 82443 done       go run cmd/test/main.go

$ ls -alh ./logs
total 440
drwxr--r--   8 user  staff   256B 11 19 11:01 .
drwxr-xr-x  28 user  staff   896B 11 19 10:46 ..
-rw-r--r--   1 user  staff   653B 11 19 11:01 test-2021-11-19T02-01-59.607.log.gz
-rw-r--r--   1 user  staff   3.9K 11 19 11:01 test-2021-11-19T02-01-59.691.log.gz
-rw-r--r--   1 user  staff   202B 11 19 11:01 test-2021-11-19T02-01-59.704.log.gz
-rw-r--r--   1 user  staff   3.8K 11 19 11:01 test-2021-11-19T02-01-59.784.log.gz
-rw-r--r--   1 user  staff   3.8K 11 19 11:01 test-2021-11-19T02-01-59.855.log.gz
-rw-r--r--   1 user  staff   196K 11 19 11:01 test.log

$ gzip -d ./logs/*.gz
$ ls -alh logs
total 6848
drwxr--r--   8 user  staff   256B 11 19 11:02 .
drwxr-xr-x  28 user  staff   896B 11 19 10:46 ..
-rw-r--r--   1 user  staff   134K 11 19 11:01 test-2021-11-19T02-01-59.607.log    <--- here: 
-rw-r--r--   1 user  staff   1.0M 11 19 11:01 test-2021-11-19T02-01-59.691.log
-rw-r--r--   1 user  staff    14K 11 19 11:01 test-2021-11-19T02-01-59.704.log     <--- here
-rw-r--r--   1 user  staff   1.0M 11 19 11:01 test-2021-11-19T02-01-59.784.log
-rw-r--r--   1 user  staff   1.0M 11 19 11:01 test-2021-11-19T02-01-59.855.log
-rw-r--r--   1 user  staff   196K 11 19 11:01 test.log

log size is not 1MB, and some log has lost. you can verify log lost situation more exactly if two process write different log.

Jaeyo avatar Nov 19 '21 02:11 Jaeyo

@Jaeyo - Thanks for sharing details. As you had pointed out, CNI binary would need to be fixed. Once we have fix, we will report back here.

srini-ram avatar Nov 19 '21 23:11 srini-ram

Thanks @Jaeyo

I was able to repro the same steps -

 ls -alh ./logs
total 680K
drwxr--r--  2 varavaj amazon 4.0K Jan 14 05:09 .
drwxr-xr-x 16 varavaj amazon 4.0K Jan 14 05:09 ..
-rw-r--r--  1 varavaj amazon  228 Jan 14 05:09 test-2022-01-14T05-09-04.177.log.gz
-rw-r--r--  1 varavaj amazon 3.8K Jan 14 05:09 test-2022-01-14T05-09-04.224.log.gz
-rw-r--r--  1 varavaj amazon  498 Jan 14 05:09 test-2022-01-14T05-09-04.227.log.gz
-rw-r--r--  1 varavaj amazon 3.6K Jan 14 05:09 test-2022-01-14T05-09-04.272.log.gz
-rw-r--r--  1 varavaj amazon  547 Jan 14 05:09 test-2022-01-14T05-09-04.280.log.gz
-rw-r--r--  1 varavaj amazon 650K Jan 14 05:09 test.log

Previously we were using "https://github.com/cihub/seelog", I quickly did a POC with the above example by replacing lumberjack with seelog and issue is not seen. As @srini-ram mentioned we don't need this for ipamd and needed only for CNI binary. ~~There is an issue with archiving though which I am debugging~~.

go run cmd/test/main.go && go run cmd/test/main.go
-rw-r--r--  1 varavaj amazon  920 Jan 14 06:37 plugin.log
-rw-r--r--  1 varavaj amazon 1012 Jan 14 06:37 plugin.log.9066
-rw-r--r--  1 varavaj amazon 1012 Jan 14 06:37 plugin.log.9067
-rw-r--r--  1 varavaj amazon 1012 Jan 14 06:37 plugin.log.9068
-rw-r--r--  1 varavaj amazon 1012 Jan 14 06:37 plugin.log.9069
-rw-r--r--  1 varavaj amazon 1012 Jan 14 06:37 plugin.log.9070
-rw-r--r--  1 varavaj amazon 1012 Jan 14 06:37 plugin.log.9071
-rw-r--r--  1 varavaj amazon 1012 Jan 14 06:37 plugin.log.9072
-rw-r--r--  1 varavaj amazon 1012 Jan 14 06:37 plugin.log.9073
-rw-r--r--  1 varavaj amazon 1012 Jan 14 06:37 plugin.log.9074
-rw-r--r--  1 varavaj amazon 1012 Jan 14 06:37 plugin.log.9075
-rw-r--r--  1 varavaj amazon 1012 Jan 14 06:37 plugin.log.9076
-rw-r--r--  1 varavaj amazon 1012 Jan 14 06:37 plugin.log.9077
-rw-r--r--  1 varavaj amazon 1012 Jan 14 06:37 plugin.log.9078
-rw-r--r--  1 varavaj amazon 1012 Jan 14 06:37 plugin.log.9079
-rw-r--r--  1 varavaj amazon 1012 Jan 14 06:37 plugin.log.9080
-rw-r--r--  1 varavaj amazon 1012 Jan 14 06:37 plugin.log.9081
-rw-r--r--  1 varavaj amazon 1012 Jan 14 06:37 plugin.log.9082
-rw-r--r--  1 varavaj amazon 1012 Jan 14 06:37 plugin.log.9083
-rw-r--r--  1 varavaj amazon 1012 Jan 14 06:37 plugin.log.9084
-rw-r--r--  1 varavaj amazon 1012 Jan 14 06:37 plugin.log.9085
-rw-r--r--  1 varavaj amazon 1012 Jan 14 06:37 plugin.log.9086
-rw-r--r--  1 varavaj amazon 1012 Jan 14 06:37 plugin.log.9087
-rw-r--r--  1 varavaj amazon 1012 Jan 14 06:37 plugin.log.9088
-rw-r--r--  1 varavaj amazon 1012 Jan 14 06:37 plugin.log.9089
-rw-r--r--  1 varavaj amazon 1012 Jan 14 06:37 plugin.log.9090
-rw-r--r--  1 varavaj amazon 8.8M Jan 14 06:37 test.log

jayanthvn avatar Jan 14 '22 06:01 jayanthvn

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

github-actions[bot] avatar Apr 29 '22 00:04 github-actions[bot]

/not stale

jayanthvn avatar Apr 29 '22 01:04 jayanthvn

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

github-actions[bot] avatar Aug 01 '22 00:08 github-actions[bot]

/not stale

jayanthvn avatar Aug 01 '22 03:08 jayanthvn

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

github-actions[bot] avatar Oct 02 '22 00:10 github-actions[bot]

/not stale

jayanthvn avatar Oct 02 '22 15:10 jayanthvn