oio-sds icon indicating copy to clipboard operation
oio-sds copied to clipboard

Rebuilder: Too many locations already known for chunks with duplicate positions

Open vdombrovski opened this issue 2 years ago • 0 comments

ISSUE TYPE
  • Bug Report
COMPONENT NAME

oio-blob-rebuilder

SDS VERSION
5.12.0
CONFIGURATION
Default
OS / ENVIRONMENT
NAME="Ubuntu"
VERSION="18.04.6 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.6 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic
SUMMARY

When locating some objects, we can see that some of them have a duplicate position:

+------+--------------------------------------------------------------------------------------------+----------------+----------------------------------+
| Pos  | Id                                                                                         | Metachunk size | Metachunk hash                   |
+------+--------------------------------------------------------------------------------------------+----------------+----------------------------------+
| 0.1  | http://100.121.97.21:6217/93746C56170BEEFCF1997B4BDED97292421B06EBD505904F7E766DB3A75EF59A |         400089 | E972767A4FE596FB11FB759D8AFB852A |
| 0.3  | http://100.121.97.21:6214/C23EA94874496257FBB275D6C07748D924EB66B1CA0576719C61FAFB33FDB82D |         400089 | E972767A4FE596FB11FB759D8AFB852A |
| 0.4  | http://100.121.98.21:6231/8913A34B162935184E4703ADAE00975E02C0EC5449576C99352A6EE7EFC207B4 |         400089 | E972767A4FE596FB11FB759D8AFB852A |
| 0.5  | http://100.121.98.21:6209/2A3D109702AAFE2B98CD77875560F46D643F5B7E18FEF9D8D91B58AD74F43DFE |         400089 | E972767A4FE596FB11FB759D8AFB852A |
| 0.6  | http://100.121.98.21:6238/6102C3FD92B1870B5906FC972133CFDE1A2538F1D7E8CAC0D81D2A0EED23F00D |         400089 | E972767A4FE596FB11FB759D8AFB852A |
| 0.7  | http://100.121.98.22:6255/E65DE5953157382C228658A44090D4341574C45E25E68534AA6E75054D9E6DC3 |         400089 | E972767A4FE596FB11FB759D8AFB852A |
| 0.8  | http://100.121.99.22:6223/EB76BF861097C37EDF0F27A57A070145F55AFD2370F1E9BBDE1EAAD013EBB6C8 |         400089 | E972767A4FE596FB11FB759D8AFB852A |
| 0.9  | http://100.121.99.22:6253/7DAFDF361F054D4C5568855BD2B655149A60629B0771F31BC04424AF9EF4CFD2 |         400089 | E972767A4FE596FB11FB759D8AFB852A |
| 0.10 | http://100.121.99.22:6231/BF8AD402FB85CE54101AF47563180B43DF67A9CA29F13242420EEA95101A9834 |         400089 | E972767A4FE596FB11FB759D8AFB852A |
| 0.11 | http://100.121.99.22:6248/341F067F78A37A5761A5C573E5FE5D0ED8E85ABA5E8B39CC06E91833F13BD903 |         400089 | E972767A4FE596FB11FB759D8AFB852A |
| 0.2  | http://100.121.97.22:6219/CE53AD9988731885FA41D9BD7598F407D3908701E653F10F3FFEE21769513600 |         400089 | E972767A4FE596FB11FB759D8AFB852A |
| 0.2  | http://100.121.97.22:6212/83E857FCDD8F91937C505D8B9B5E3F8AC8C08B4DF4DF98BCD01FCBE815167D11 |         400089 | E972767A4FE596FB11FB759D8AFB852A |
| 0.0  | http://100.121.97.22:6204/D78798C6491011730588257394F096DB5782FAE9D33BEAB0F5C1819F898F9AC9 |         400089 | E972767A4FE596FB11FB759D8AFB852A |
+------+--------------------------------------------------------------------------------------------+----------------+----------------------------------+

From https://github.com/open-io/oio-sds/pull/1909, it looks like this is a "normal" behavior, as the oioproxy doesn't check at all the integrity of a passed JSON when performing a content/create call.

However when rebuilding we get the following error:

2022-06-08 14:41:53.336 194635 7FD2E20B9370 log ERROR ERROR while rebuilding chunk OPENIO|A5B2970E8FBD895C486A853C2D3848AAF2835548AA1BEF2B251181FB07213EEE|DBB63ACFB7DA0500F2420539E536E998|6E561DA710A5027CA104712707BCEF7398A0C41939744791852B992D0A29AAF5: No spare chunk: found only 0 services matching the criteria (pool=EC573SITE): too many locations already known (12), maximum 12 locations for this storage policy

Which is understandable, as the duplicate chunks count towards valid spare chunk candidates.

STEPS TO REPRODUCE

Somehow create a chunk with a duplicate position. You can forge a chunk list with a duplicate position, and send the JSON to a content/create call to the oioproxy when doing so. Just make sure that the object locate command returns an output as described above.

# Target the chunk to be rebuilt:
echo "A5B2970E8FBD895C486A853C2D3848AAF2835548AA1BEF2B251181FB07213EEE|DBB63ACFB7DA0500F2420539E536E998|6E561DA710A5027CA104712707BCEF7398A0C41939744791852B992D0A29AAF5" > /tmp/to_rebuild
oio-blob-rebuilder OPENIO --input-file /tmp/to_rebuild
EXPECTED RESULTS

Successful rebuilding of the chunk

ACTUAL RESULTS
022-06-08 14:41:02.204 194586 7F5FDB2BE4B0 log INFO Failed to find spare chunk (attempt 1/3): found only 0 services matching the criteria (pool=EC573SITE): too many locations already known (12), maximum 12 locations for this storage policy (HTTP 400) (STATUS 400)
2022-06-08 14:41:02.206 194586 7F5FDB2BE4B0 log INFO Failed to find spare chunk (attempt 2/3): found only 0 services matching the criteria (pool=EC573SITE): too many locations already known (12), maximum 12 locations for this storage policy (HTTP 400) (STATUS 400)
2022-06-08 14:41:02.208 194586 7F5FDB2BE4B0 log INFO Failed to find spare chunk (attempt 3/3): found only 0 services matching the criteria (pool=EC573SITE): too many locations already known (12), maximum 12 locations for this storage policy (HTTP 400) (STATUS 400)
2022-06-08 14:41:02.208 194586 7F5FDE38C370 log ERROR ERROR while rebuilding chunk OPENIO|A5B2970E8FBD895C486A853C2D3848AAF2835548AA1BEF2B251181FB07213EEE|DBB63ACFB7DA0500F2420539E536E998|6E561DA710A5027CA104712707BCEF7398A0C41939744791852B992D0A29AAF5: No spare chunk: found only 0 services matching the criteria (pool=EC573SITE): too many locations already known (12), maximum 12 locations for this storage policy

A partial fix would consist in removing duplicate chunk positions before feeding the list to the _get_spare_chunk function (needs to be done for EC and repli):

--- /usr/lib/python2.7/dist-packages/oio/content/ec_old.py      2022-06-08 14:57:10.918012798 +0000
+++ /usr/lib/python2.7/dist-packages/oio/content/ec.py  2022-06-08 14:56:40.222204985 +0000
@@ -58,9 +58,12 @@
         # Find a spare chunk address
         broken_list = list()

+        used = set()
+        candidates = [c for c in chunks.all() if c.pos not in used and c.pos != current_chunk.pos and (used.add(c.pos) or True)]
+
         if not allow_same_rawx and chunk_id is not None:
             broken_list.append(current_chunk)
-        spare_url, _quals = self._get_spare_chunk(chunks.all(), broken_list)
+        spare_url, _quals = self._get_spare_chunk(candidates, broken_list)
         new_chunk = Chunk({'pos': current_chunk.pos, 'url': spare_url[0]})

         # Regenerate the lost chunk's data, from existing chunks

Warning however, as this sometimes will generate errors as such:

2022-06-08 14:44:52.500 194742 7FE3C2B4F370 log ERROR ERROR while rebuilding chunk OPENIO|43111ECD2732E20A123114C34E1F8E740ADA5D8256CF27A3AB77213FFFFEB678|B875B33177DA05000CB194A6F73A6CA7|01E9E28CBD0BB4E0DA1A408F3AD246D44CECD1915F6D52D496370BF23874B029: pyeclib_c_reconstruct ERROR: Insufficient number of fragments. Please inspect syslog for liberasurecode error report.

Depending on what chunks have been selected for rebuild. We think that duplicate chunk positions also lead to the same fragment being uploaded in 2 different positions, which is in itself another issue.

vdombrovski avatar Jun 08 '22 15:06 vdombrovski