api icon indicating copy to clipboard operation
api copied to clipboard

osdsdock stops backend pool discovery on empty pool list retrieval

Open thatsdone opened this issue 6 years ago • 7 comments

Is this a BUG REPORT or FEATURE REQUEST?: /kind bug

What happened:

osdsdock stops backend pool discovery on empty pool list retrieval.

What you expected to happen:

osdsdock continues backend pool discovery even if it received an empty pool list.

How to reproduce it (as minimally and precisely as possible):

Try to setup an all-in-one opensds box with only ceph backend using the current master of opensds-installer.

Anything else we need to know?:

Please find my analysis below.

osdsdock stops backend pool discovery loop when Discover() got an empty pool list from the underlying driver.

Getting an empty list of pools here is not a failure, but a successful result originated from ListPools() of backend drivers. So, Discover() should not return an error (using fmt.Errorf())
at line 178:

https://github.com/opensds/opensds/blob/stable/capri/pkg/dock/discovery/discovery.go#L178

On the contrary, Discover() should return an error at line 167 because there 'err' contains error information from the underlying ListPools(). For example, communication failure with the b
ackend storage server.

https://github.com/opensds/opensds/blob/stable/capri/pkg/dock/discovery/discovery.go#L167

The below is terminal log that I took right after opensds-installer completion.

[ubuntu@ubuntu203 ~(opensds_admin)]$ osdsctl pool list
+----+------+-------------+--------+---------------+--------------+
| Id | Name | Description | Status | TotalCapacity | FreeCapacity |
+----+------+-------------+--------+---------------+--------------+
+----+------+-------------+--------+---------------+--------------+

[ubuntu@ubuntu203 ~(opensds_admin)]$ cat /var/log/opensds/osdsdock.INFO
Log file created at: 2019/08/23 13:02:18
Running on machine: ubuntu203
Binary: Built with gc go1.11.2 for linux/amd64
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I0823 13:02:18.823034   21471 logs.go:75] [Info] LogFlushFrequency: 5s
I0823 13:02:18.828333   21471 dock.go:112] Dock server initialized! Start listening on port:[::]:50050
W0823 13:02:18.835255   21471 discovery.go:173] The pool of dock 1d725929-be84-5b4c-bd18-1e7c82c5cc3b is empty!
E0823 13:02:18.835434   21471 dock.go:96] when calling capabilty report method:There is no pool can be found.

In the last line above, "There is no pool can be found" is returned from line 178-179 of Discovery(), and osdsdock stops periodical discovery process.

Also, after restarting osdsdock, I got the result below.

[root@ubuntu203 opensds-hotpot-linux-amd64(opensds_admin)]# osdsctl pool list
+--------------------------------------+------+-------------+--------+---------------+--------------+
| Id                                   | Name | Description | Status | TotalCapacity | FreeCapacity |
+--------------------------------------+------+-------------+--------+---------------+--------------+
| 0517f561-85b3-5f6a-a38d-8b5a08bff7df | rbd  |             |        | 23            | 23           |
+--------------------------------------+------+-------------+--------+---------------+--------------+

ubuntu@ubuntu203:~$ cat /var/log/opensds/osdsdock.INFO
Log file created at: 2019/08/23 13:07:45
Running on machine: ubuntu203
Binary: Built with gc go1.11.2 for linux/amd64
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I0823 13:07:45.932021   22590 logs.go:75] [Info] LogFlushFrequency: 5s
I0823 13:07:45.933064   22590 dock.go:112] Dock server initialized! Start listening on port:[::]:50050
I0823 13:07:45.944763   22590 discovery.go:161] Backend ceph discovered pool rbd
I0823 13:08:45.959946   22590 discovery.go:161] Backend ceph discovered pool rbd

# osdsdock contines to output the above lines while checking pool list of backend.

IMHO, this is related to multiple other issues.

For example, now opensds-installer has several issues of ceph backend installation, and some of them could have the same root cause.

https://github.com/opensds/opensds-installer/issues/255 etc.

Also, ceph mimic integration issue could have the same root cause.

https://github.com/opensds/opensds/issues/989

Environment:

  • Hotpot(release/branch) version: v0.6.1 (Capri)

  • OS (e.g. from /etc/os-release): Ubuntu 16.04.6

  • Kernel (e.g. uname -a): Linux ubuntu203 4.4.0-157-generic #185-Ubuntu SMP Tue Jul 23 09:17:01 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

  • Install tools: opensds-installer

  • Others:

thatsdone avatar Aug 24 '19 15:08 thatsdone

@joseph-v please take a look at it

wisererik avatar Aug 26 '19 04:08 wisererik

it's solved with pr opensds/opensds-installer#273, will close it now

wisererik avatar Sep 09 '19 04:09 wisererik

@wisererik Ah, my installer side fix is just a workaround, and we need to pay attention to osdsdock side error handling while discovery process, IMHO. So, please do not close this issue yet.

If my point makes sense and other folks are busy, I can work on this osdsdock side topic.

thatsdone avatar Sep 09 '19 07:09 thatsdone

OK, Himanshu told me it's discoveried by osdsdock. will check again until there is no problem in master branch.

wisererik avatar Sep 09 '19 09:09 wisererik

I have same issue. After There is no pool can be found came out, discovery is stopped. I used kubernetes install(https://github.com/opensds/opensds/tree/master/install/kubernetes)

And I wonder how discover module find LVM volume. sometimes it looks not finding when I add new VG(vgcreate).

jmjoo avatar Sep 16 '19 10:09 jmjoo

Hi, I created a tentative PR for discussion. https://github.com/opensds/opensds/pull/1014

I hope NOTEs lines can help core-dev members to understand my points. Please note the PR is not intended for merge at the moment because it contains discussion notes and does not contain test code.

thatsdone avatar Sep 20 '19 09:09 thatsdone

Some additional notes from slack discussion with Himanshu and Ashit regarding https://github.com/opensds/opensds/pull/1014


There are 2 points.

  1. Return a non-error result on empty pool list retrieval. (L179)
    • I think this is the root cause of this issue.
  2. Return an error on backend pool list retrieval failure. (L168)

Also, another discussion point is that osdsdock handles multiple backends by a single process. Thus, the best resolution could be having per-backend state and stop polling on faulty backend(s) only using per backend error count in order to make 'osdsdock' keep working for healthy (survive?) backend(s). In this sense, using goroutine per backend would also make sense, IMHO.

thatsdone avatar Sep 22 '19 07:09 thatsdone