sonic-buildimage icon indicating copy to clipboard operation
sonic-buildimage copied to clipboard

[multiasic][supervisor] sonic-db-cli crashes at boot up when execute sonic-db-cli PING command in database.sh on multiasic platform

Open mlok-nokia opened this issue 2 years ago • 6 comments

Description

On supervisor card, sonic-db-cli crashes when executes the sonic-db-cli PING command in the database.sh. The new implementation of the sonci-db-cli with PING command calls initializeGlobalConfig() which will check all ASICs redis#/sonic-db/database_config.json files which are not ready yet. This cause crash and the following error log. This function was used to wait for all database ready. If sonic-db-cli tries to access redis#/sonic-db/database_config.json files, it will failed.

Sep  9 23:21:15 sonic sonic-db-cli: :- parseDatabaseConfig: Sonic database config file doesn't exist at /var/run/redis/sonic-db/../../redis0/sonic-db/database_config.json
Sep  9 23:21:15 sonic database.sh[4739]: terminate called after throwing an instance of 'std::runtime_error'
Sep  9 23:21:15 sonic database.sh[4739]:   what():  Sonic database config file syntax error >> Sonic database config file doesn't exist at /var/run/redis/sonic-db/../../redis0/sonic-db/database_config.json
Sep  9 23:21:15 sonic sonic-db-cli: :- initializeGlobalConfig: Sonic database config file syntax error >> Sonic database config file doesn't exist at /var/run/redis/sonic-db/../../redis0/sonic-db/database_config.json

There are 16 ASICs on this supervisor cards. This issue is similar to the isisue https://github.com/sonic-net/sonic-buildimage/issues/10105. If sonic-db-cli behavior is changed, we may need to change waitForAllInstanceDatabaseConfigJsonFilesReady

Steps to reproduce the issue:

  1. Reboot the the syatem with the new image.

Describe the results you received:

There are core files. and the following error logs

Sep  9 23:21:15 sonic sonic-db-cli: :- parseDatabaseConfig: Sonic database config file doesn't exist at /var/run/redis/sonic-db/../../redis0/sonic-db/database_config.json
Sep  9 23:21:15 sonic database.sh[4739]: terminate called after throwing an instance of 'std::runtime_error'
Sep  9 23:21:15 sonic database.sh[4739]:   what():  Sonic database config file syntax error >> Sonic database config file doesn't exist at /var/run/redis/sonic-db/../../redis0/sonic-db/database_config.json
Sep  9 23:21:15 sonic sonic-db-cli: :- initializeGlobalConfig: Sonic database config file syntax error >> Sonic database config file doesn't exist at /var/run/redis/sonic-db/../../redis0/sonic-db/database_config.json

Describe the results you expected:

There should not be any core file and no error log against the sonic-db-cli.

Output of show version:

(paste your output here)

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

mlok-nokia avatar Sep 12 '22 15:09 mlok-nokia

@qiluo-msft can you please help to check the sonic-db-cli behavior change and see how to fix? looks like scalability issue Thanks.

zhangyanzhao avatar Sep 14 '22 15:09 zhangyanzhao

@SuvarnaMeenakshi - would we please check if multi-asic vs tests would catch this? Thanks.

rlhui avatar Sep 14 '22 15:09 rlhui

@abdosi , This is the same as we are observing on 202205 based image.

anamehra avatar Sep 14 '22 15:09 anamehra

parseDatabaseConfig

@SuvarnaMeenakshi - would we please check if multi-asic vs tests would catch this? Thanks.

As this error is seen during boot up, multi-asic VS tests suite we have today in PR checker will not be able to flag this. This might be the case for any boot up exception seen in syslog. If there is a reboot test case and post reboot exception seen in syslog will be flagged by log analyzer.

This specific issue is seen only on supervisor and not seen on multi-asic VS or multi-asic LC

SuvarnaMeenakshi avatar Oct 21 '22 23:10 SuvarnaMeenakshi

Create following PR to fix this issue: https://github.com/sonic-net/sonic-swss-common/pull/701

According to the database.sh code, it will wait until database ready by check sonic-db-cli return value, when database not ready sonic-db-cli should return 1:

https://github.com/sonic-net/sonic-buildimage/blob/master/files/build_templates/docker_image_ctl.j2

        until [[ ($(docker exec -i database$DEV pgrep -x -c supervisord) -gt 0) && ($($SONIC_DB_CLI PING | grep -c PONG) -gt 0) &&
                 ($(docker exec -i database$DEV sonic-db-cli PING | grep -c PONG) -gt 0) ]]; do
          sleep 1;
        done

However, because a code regression in sonic-db-cli, sonic-db-cli will crash.

liuh-80 avatar Oct 24 '22 03:10 liuh-80

fix available, please confirm if this can be closed @mlok-nokia

rlhui avatar Nov 11 '22 18:11 rlhui

I checked the changes in 202205 branch. It doesn't fix all issues. Although the change avoids the crash occurs and allow the database to load the configuration file, but the core files are still generated.

admin@supervisor:~$ ls /var/core -al total 376 drwxr-xr-x 1 root root 4096 Nov 22 22:00 . drwxr-xr-x 1 root root 4096 Nov 22 20:50 .. -rw-r--r-- 1 root root 88525 Nov 22 21:42 sonic-db-cli.1669153338.6192.core.gz -rw-r--r-- 1 root root 93392 Nov 22 21:42 sonic-db-cli.1669153339.6757.core.gz -rw-r--r-- 1 root root 93413 Nov 22 21:42 sonic-db-cli.1669153339.6886.core.gz -rw-r--r-- 1 root root 93284 Nov 22 21:42 sonic-db-cli.1669153339.7072.core.gz

mlok-nokia avatar Nov 22 '22 22:11 mlok-nokia

@mlok-nokia, because the PR #13207 merged, could you please confirm we can close this issue and https://github.com/sonic-net/sonic-buildimage/issues/13740?

liuh-80 avatar Feb 23 '23 08:02 liuh-80