yugabyte-db icon indicating copy to clipboard operation
yugabyte-db copied to clipboard

[YSQL] Fewer number of rows get ingested in the database while importing the data in the table

Open priyanshi-yb opened this issue 2 years ago • 2 comments

Jira Link: DB-3596

Description

We tried to import the data in the table with YB Voyager and experiencing the missing rows on the Database.

Input Data we experimented with is of 350M, 500M and ~700M and experienced less number of rows in the database.

700172554 are the number of rows in the input data file. Result of count(*) on that table -

etl2=# select count(*) from accounts;
   count
-----------
 699441644
(1 row)

Time: 206827.708 ms (03:26.828)

DB Build name - 2.14.2.0-b16 DB Version - 11.2-YB-2.14.2.0-b0

Steps to Reproduce -

  1. Can use this EC2 instance - https://us-west-2.console.aws.amazon.com/ec2/home?region=us-west-2#InstanceDetails:instanceId=i-0055874eee0c39925 (Machine Ip - 10.9.12.152)
  2. Cluster which can be used -
The cluster name - pgupta-voyager-test-2 (https://portal.dev.yugabyte.com/universes/e0bd5274-9145-46a2-b27f-109c6524ac2e )
  1. Execute these DDL on the database to create the table -
CREATE TABLE public.accounts (
    block bigint NOT NULL,
    address text NOT NULL,
    dc_balance bigint DEFAULT 0 NOT NULL,
    dc_nonce bigint DEFAULT 0 NOT NULL,
    security_balance bigint DEFAULT 0 NOT NULL,
    security_nonce bigint DEFAULT 0 NOT NULL,
    balance bigint DEFAULT 0 NOT NULL,
    nonce bigint DEFAULT 0 NOT NULL,
    staked_balance bigint
);
ALTER TABLE ONLY public.accounts
    ADD CONSTRAINT accounts_pkey PRIMARY KEY (block, address);
  1. Run this import data command on the above machine to ingest the data of 350M rows in the table -
 time ~/sanyam_pg_stat_build/yb-voyager import data file --export-dir /data/export-dir-accounts/    --target-db-host 172.151.18.14     --target-db-user ybvoyager     --target-db-password password     --target-db-name etl2_pg    --data-dir "/data/small-data/"     --file-table-map "accounts_350m_data.sql:accounts"    --delimiter "\t" --format "text" --parallel-jobs 288 --start-clean
  1. Get the count on DB -
 select count(*) from accounts;

Key Points about the Import -

  • ROWS_PER_TRANSACTIONS is 20000 for each COPY FROM Command Voyager is executing.
  • Voyager is setting these session variables -
    SET client_encoding TO 'UTF8';
    SET session_replication_role TO replica;
    SET yb_enable_upsert_mode to true;
    

Some other Observations -

  • We summed up the return values from COPY commands which we are logging in the Voyager log files and it is equal to the rows we tried to ingest that.
  • Also, we tried to log the tuples_processed value from pg_stat_progress_copy after each COPY and summed that up which is also equal to the rows we tried to ingest.
  • We tried disabling upsert mode as well, It is also giving less number of rows after the import.

Location of the Yb Voyager log file in the Machine mentioned above - /data/export-dir-accounts/yb-voyager_350m_disable_upsert_tuples_processed.log

Location of actual ~700M rows data file - /data/export-dir2/accounts_data.sql

Location of 350M and 500M rows data file - /data/small-data/accounts_350m_data.sql and /data/small-data/accounts_500m_data.sql

priyanshi-yb avatar Sep 21 '22 15:09 priyanshi-yb

Did you check select * from pg_stat_progress_copy; after the copy? Are there any failures? Locally with 11 copies of 1000000 tuples each I got one timeout

ERROR:  Timed out: Perform RPC (request call id 812) to 127.0.0.1:9100 timed out after 120.000s

And the progress copy reflects this too:

yugabyte=# select * from pg_stat_progress_copy;
  pid  | datid | datname  | relid |  command  | type | yb_status | bytes_processed | bytes_total | tuples_processed | tuples_excluded
-------+-------+----------+-------+-----------+------+-----------+-----------------+-------------+------------------+-----------------
 76235 | 13288 | yugabyte | 16395 | COPY FROM | PIPE | ERROR     |        18664530 |           0 |           220000 |               0
 76243 | 13288 | yugabyte | 16395 | COPY FROM | PIPE | SUCCESS   |        85083176 |           0 |          1000000 |               0
 76283 | 13288 | yugabyte | 16395 | COPY FROM | PIPE | SUCCESS   |        85365058 |           0 |          1000000 |               0
 76318 | 13288 | yugabyte | 16395 | COPY FROM | PIPE | SUCCESS   |        84801490 |           0 |          1000000 |               0
 76355 | 13288 | yugabyte | 16395 | COPY FROM | PIPE | SUCCESS   |        85361495 |           0 |          1000000 |               0
 76391 | 13288 | yugabyte | 16395 | COPY FROM | PIPE | SUCCESS   |        85057124 |           0 |          1000000 |               0
 76438 | 13288 | yugabyte | 16395 | COPY FROM | PIPE | SUCCESS   |        84940628 |           0 |          1000000 |               0
 76475 | 13288 | yugabyte | 16395 | COPY FROM | PIPE | SUCCESS   |        85349683 |           0 |          1000000 |               0
 76518 | 13288 | yugabyte | 16395 | COPY FROM | PIPE | SUCCESS   |        85282387 |           0 |          1000000 |               0
 76807 | 13288 | yugabyte | 16395 | COPY FROM | PIPE | SUCCESS   |        84821101 |           0 |          1000000 |               0
 76816 | 13288 | yugabyte | 16395 | COPY FROM | PIPE | SUCCESS   |        84796236 |           0 |          1000000 |               0
(11 rows)

And the rows are missing of course:

yugabyte=# select count(*) from accounts;
  count
----------
 10220000
(1 row)

Edit: Was checked, so this is a different issue

def- avatar Sep 23 '22 10:09 def-

With 10 threads on an RF3 cluster I am not able to successfully load in 1000000 rows:

$ bin/yb-ctl --replication_factor 3 create --tserver_flags=enable_automatic_tablet_splitting=false --master_flags=enable_automatic_tablet_splitting=false
$ ./bulkcopy.py 100000000 10 127.0.0.1:5433
COPY 0
COPY 3
COPY 8
COPY 6
COPY 2
COPY 5
COPY 7
COPY 1
COPY 9
COPY 4
Exception in thread Thread-9 (run):
Traceback (most recent call last):
  File "/opt/homebrew/Cellar/[email protected]/3.10.6_2/Frameworks/Python.framework/Versions/3.10/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/opt/homebrew/Cellar/[email protected]/3.10.6_2/Frameworks/Python.framework/Versions/3.10/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/deen/./bulkcopy.py", line 38, in run
    cur.copy_expert('copy accounts (block, address, dc_balance, dc_nonce, security_balance, security_nonce, balance, nonce, staked_balance) from stdin with (format csv)', f)
psycopg2.errors.InternalError_: Already present: [Already present (yb/consensus/retryable_requests.cc:244): Duplicate request 20971 from client 0345d34f-eb41-48d7-a73c-d7c3596e2de8 (min running 20967)]

Exception in thread Thread-8 (run):
Traceback (most recent call last):
  File "/opt/homebrew/Cellar/[email protected]/3.10.6_2/Frameworks/Python.framework/Versions/3.10/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/opt/homebrew/Cellar/[email protected]/3.10.6_2/Frameworks/Python.framework/Versions/3.10/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/deen/./bulkcopy.py", line 38, in run
    cur.copy_expert('copy accounts (block, address, dc_balance, dc_nonce, security_balance, security_nonce, balance, nonce, staked_balance) from stdin with (format csv)', f)
psycopg2.errors.SerializationFailure: IO error: Errors occurred while reaching out to the tablet servers: . Errors from tablet servers: [Already present (yb/consensus/retryable_requests.cc:244): Duplicate request 20993 from client 0345d34f-eb41-48d7-a73c-d7c3596e2de8 (min running 20983), Already present (yb/consensus/retryable_requests.cc:244): Duplicate request 20993 from client 0345d34f-eb41-48d7-a73c-d7c3596e2de8 (min running 20988), Already present (yb/consensus/retryable_requests.cc:244): Duplicate request 20993 from client 0345d34f-eb41-48d7-a73c-d7c3596e2de8 (min running 20989), Already present (yb/consensus/retryable_requests.cc:244): Duplicate request 20993 from client 0345d34f-eb41-48d7-a73c-d7c3596e2de8 (min running 20992), Operation failed. Try again (yb/tablet/transaction_participant.cc:290): Unknown transaction, could be recently aborted: 5af73027-031b-4190-ba19-6b1eaa9a1da5 (pgsql error 40001)]

Exception in thread Thread-1 (run):
Traceback (most recent call last):
  File "/opt/homebrew/Cellar/[email protected]/3.10.6_2/Frameworks/Python.framework/Versions/3.10/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/opt/homebrew/Cellar/[email protected]/3.10.6_2/Frameworks/Python.framework/Versions/3.10/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/deen/./bulkcopy.py", line 38, in run
    cur.copy_expert('copy accounts (block, address, dc_balance, dc_nonce, security_balance, security_nonce, balance, nonce, staked_balance) from stdin with (format csv)', f)
psycopg2.errors.InternalError_: Already present: [Already present (yb/consensus/retryable_requests.cc:244): Duplicate request 20970 from client 0345d34f-eb41-48d7-a73c-d7c3596e2de8 (min running 20967)]

Exception in thread Thread-7 (run):
Traceback (most recent call last):
  File "/opt/homebrew/Cellar/[email protected]/3.10.6_2/Frameworks/Python.framework/Versions/3.10/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/opt/homebrew/Cellar/[email protected]/3.10.6_2/Frameworks/Python.framework/Versions/3.10/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/deen/./bulkcopy.py", line 38, in run
    cur.copy_expert('copy accounts (block, address, dc_balance, dc_nonce, security_balance, security_nonce, balance, nonce, staked_balance) from stdin with (format csv)', f)
psycopg2.errors.InternalError_: Already present: [Already present (yb/consensus/retryable_requests.cc:244): Duplicate request 20977 from client 0345d34f-eb41-48d7-a73c-d7c3596e2de8 (min running 20967)]

Exception in thread Thread-10 (run):
Traceback (most recent call last):
  File "/opt/homebrew/Cellar/[email protected]/3.10.6_2/Frameworks/Python.framework/Versions/3.10/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/opt/homebrew/Cellar/[email protected]/3.10.6_2/Frameworks/Python.framework/Versions/3.10/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/deen/./bulkcopy.py", line 38, in run
    cur.copy_expert('copy accounts (block, address, dc_balance, dc_nonce, security_balance, security_nonce, balance, nonce, staked_balance) from stdin with (format csv)', f)
psycopg2.errors.InternalError_: Already present: [Already present (yb/consensus/retryable_requests.cc:244): Duplicate request 20968 from client 0345d34f-eb41-48d7-a73c-d7c3596e2de8 (min running 20967)]

Exception in thread Thread-3 (run):
Traceback (most recent call last):
  File "/opt/homebrew/Cellar/[email protected]/3.10.6_2/Frameworks/Python.framework/Versions/3.10/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/opt/homebrew/Cellar/[email protected]/3.10.6_2/Frameworks/Python.framework/Versions/3.10/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/deen/./bulkcopy.py", line 38, in run
    cur.copy_expert('copy accounts (block, address, dc_balance, dc_nonce, security_balance, security_nonce, balance, nonce, staked_balance) from stdin with (format csv)', f)
psycopg2.errors.InternalError_: Already present: [Already present (yb/consensus/retryable_requests.cc:244): Duplicate request 20973 from client 0345d34f-eb41-48d7-a73c-d7c3596e2de8 (min running 20967)]

Exception in thread Thread-5 (run):
Traceback (most recent call last):
  File "/opt/homebrew/Cellar/[email protected]/3.10.6_2/Frameworks/Python.framework/Versions/3.10/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/opt/homebrew/Cellar/[email protected]/3.10.6_2/Frameworks/Python.framework/Versions/3.10/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/deen/./bulkcopy.py", line 38, in run
    cur.copy_expert('copy accounts (block, address, dc_balance, dc_nonce, security_balance, security_nonce, balance, nonce, staked_balance) from stdin with (format csv)', f)
psycopg2.errors.InternalError_: Already present: [Already present (yb/consensus/retryable_requests.cc:244): Duplicate request 20982 from client 0345d34f-eb41-48d7-a73c-d7c3596e2de8 (min running 20967)]

Exception in thread Thread-2 (run):
Traceback (most recent call last):
  File "/opt/homebrew/Cellar/[email protected]/3.10.6_2/Frameworks/Python.framework/Versions/3.10/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/opt/homebrew/Cellar/[email protected]/3.10.6_2/Frameworks/Python.framework/Versions/3.10/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
Exception in thread Thread-4 (run):
Traceback (most recent call last):
  File "/opt/homebrew/Cellar/[email protected]/3.10.6_2/Frameworks/Python.framework/Versions/3.10/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
  File "/Users/deen/./bulkcopy.py", line 38, in run
Exception in thread Thread-6 (run):
Traceback (most recent call last):
  File "/opt/homebrew/Cellar/[email protected]/3.10.6_2/Frameworks/Python.framework/Versions/3.10/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/opt/homebrew/Cellar/[email protected]/3.10.6_2/Frameworks/Python.framework/Versions/3.10/lib/python3.10/threading.py", line 953, in run
    self.run()
  File "/opt/homebrew/Cellar/[email protected]/3.10.6_2/Frameworks/Python.framework/Versions/3.10/lib/python3.10/threading.py", line 953, in run
    cur.copy_expert('copy accounts (block, address, dc_balance, dc_nonce, security_balance, security_nonce, balance, nonce, staked_balance) from stdin with (format csv)', f)
psycopg2.errors.SerializationFailure: IO error: Errors occurred while reaching out to the tablet servers: . Errors from tablet servers: [Already present (yb/consensus/retryable_requests.cc:244): Duplicate request 20989 from client 0345d34f-eb41-48d7-a73c-d7c3596e2de8 (min running 20989), Operation failed. Try again (yb/tablet/running_transaction.cc:474): Transaction aborted: df3dffee-f6c6-4bcd-ab87-a137234b9eac (pgsql error 40001), Operation failed. Try again (yb/tablet/transaction_participant.cc:209): Transaction was recently aborted: df3dffee-f6c6-4bcd-ab87-a137234b9eac: 40001 (pgsql error 40001) (transaction error 1)]

    self._target(*self._args, **self._kwargs)
  File "/Users/deen/./bulkcopy.py", line 38, in run
    cur.copy_expert('copy accounts (block, address, dc_balance, dc_nonce, security_balance, security_nonce, balance, nonce, staked_balance) from stdin with (format csv)', f)
psycopg2.errors.InternalError_: Already present: [Already present (yb/consensus/retryable_requests.cc:244): Duplicate request 20981 from client 0345d34f-eb41-48d7-a73c-d7c3596e2de8 (min running 20967)]

    self._target(*self._args, **self._kwargs)
  File "/Users/deen/./bulkcopy.py", line 38, in run
    cur.copy_expert('copy accounts (block, address, dc_balance, dc_nonce, security_balance, security_nonce, balance, nonce, staked_balance) from stdin with (format csv)', f)
psycopg2.errors.InternalError_: Already present: [Already present (yb/consensus/retryable_requests.cc:244): Duplicate request 20967 from client 0345d34f-eb41-48d7-a73c-d7c3596e2de8 (min running 20967)]

20880000

https://gist.github.com/def-/44ad86e43bf976300c6976d29e261287 postgres log:

I0923 13:57:38.380878 57640320 pg_client.cc:132] S 52: Session id acquired
I0923 14:30:06.352149 57640320 thread_pool.cc:170] Starting thread pool { name: pggate_ybclient queue_limit: 10000 max_workers: 1024 }
I0923 14:30:06.370754 57640320 pggate.cc:139] InitTServerSharedObject: 0, 0, 4
W0923 14:30:06.384006 57640320 hybrid_clock.cc:133] HybridClock initialized in local mode (OS X only). Not suitable for distributed clusters.
I0923 14:30:06.390792 57640320 pg_client.cc:125] Using TServer host_port: 127.0.0.1:9100
I0923 14:30:06.535517 57640320 pg_client.cc:132] S 53: Session id acquired
2022-09-23 14:33:37.757 CEST [82356] ERROR:  Already present: [Already present (yb/consensus/retryable_requests.cc:244): Duplicate request 20968 from client 0345d34f-eb41-48d7-a73c-d7c3596e2de8 (min running 20967)]
[...]
2022-09-23 14:33:38.637 CEST [82348] ERROR:  IO error: Errors occurred while reaching out to the tablet servers: . Errors from tablet servers: [Already present (yb/consensus/retryable_requests.cc:244): Duplicate request 20989 from client 0345d34f-eb41-48d7-a73c-d7c3596e2de8 (min running 20989), Operation failed. Try again (yb/tablet/running_transaction.cc:474): Transaction aborted: df3dffee-f6c6-4bcd-ab87-a137234b9eac (pgsql error 40001), Operation failed. Try again (yb/tablet/transaction_participant.cc:209): Transaction was recently aborted: df3dffee-f6c6-4bcd-ab87-a137234b9eac: 40001 (pgsql error 40001) (transaction error 1)]
422 2022-09-23 14:33:38.637 CEST [82348] STATEMENT:  copy accounts (block, address, dc_balance, dc_nonce,     security_balance, security_nonce, balance, nonce, staked_balance) from stdin with (format csv)
2022-09-23 14:33:38.643 CEST [82354] ERROR:  IO error: Errors occurred while reaching out to the tablet servers: . Errors from tablet servers: [Already present (yb/consensus/retryable_requests.cc:244): Duplicate request 20993 from client 0345d34f-eb41-48d7-a73c-d7c3596e2de8 (min running 20983), Already present (yb/consensus/retryable_requests.cc:244): Duplicate request 20993 from client 0345d34f-eb41-48d7-a73c-d7c3596e2de8 (min running 20988), Already present (yb/consensus/retryable_requests.cc:244): Duplicate request 20993 from client 0345d34f-eb41-48d7-a73c-d7c3596e2de8 (min running 20989), Already present (yb/consensus/retryable_requests.cc:244): Duplicate request 20993 from client 0345d34f-eb41-48d7-a73c-d7c3596e2de8 (min running 20992), Operation failed. Try again (yb/tablet/transaction_participant.cc:290): Unknown transaction, could be recently aborted: 5af73027-031b-4190-ba19-6b1eaa9a1da5 (pgsql error 40001)]
2022-09-23 14:33:38.643 CEST [82354] STATEMENT:  copy accounts (block, address, dc_balance, dc_nonce, security_balance, security_nonce, balance, nonce, staked_balance) from stdin with (format csv)

def- avatar Sep 23 '22 12:09 def-

Edit: 9900 rows is actually expected, I am trying with larger amount of data now.

Original text: With these changes I have now managed to reproduce a successful bulk copy with missing rows in the end:

  • 300 threads
  • Use 2.14.2.0-b25 version instead of latest master
  • Enable upsert mode
  • Use Linux instead of macOS
  • Create trigger Successful run of bin/yb-ctl create --rf3 && sleep 1m && ./bulkcopy.py 10000 300 127.0.0.1:5433,127.0.0.2:5433,127.0.0.3:5433 should have inserted 10000 rows, but I’m seeing fewer:
yugabyte=# select count(*) from accounts;
 count
-------
  9900
(1 row)

There was a postgres crash during this execution:

[dfelsing@dev-server-dfelsing yugabyte-data]$ coredumpctl gdb 1070136
           PID: 1070136 (postgres)
           UID: 2200 (dfelsing)
           GID: 1500 (ybdev)
        Signal: 11 (SEGV)
     Timestamp: Tue 2022-09-27 16:29:58 UTC (5min ago)
  Command Line: /nfusr/dev-server/dfelsing/yugabyte-2.14.2.0/postgres/bin/postgres --boot -x1 -X 16777216 -F
    Executable: /nfusr/dev-server/dfelsing/yugabyte-2.14.2.0/postgres/bin/postgres
 Control Group: /user.slice/user-2200.slice/session-5856.scope
          Unit: session-5856.scope
         Slice: user-2200.slice
       Session: 5856
     Owner UID: 2200 (dfelsing)
       Boot ID: 3cad0bcb69ef4825b9977abd16c29451
    Machine ID: 8ce570b287c573260282d2e8edd33beb
      Hostname: dev-server-dfelsing
       Storage: /var/lib/systemd/coredump/core.postgres.2200.3cad0bcb69ef4825b9977abd16c29451.1070136.1664296198000000.lz4
       Message: Process 1070136 (postgres) of user 2200 dumped core.

                Stack trace of thread 1070136:
                #0  0x00007fa96299a044 _ZNK2yb19SharedMemorySegment10GetAddressEv (libyb_util.so)
                #1  0x00007fa967729bcb _ZN2yb6pggate8PgClient5StartEPNS_3rpc10ProxyCacheEPNS2_9SchedulerERKNS_18SharedMemoryObjectINS_7tserver17TServerSharedDataEEE (libyb_pggate.so)
                #2  0x00007fa96771948b _ZN2yb6pggate9PgApiImplC1ENS0_12PgApiContextEPK12PgTypeEntityi11PgCallbacks (libyb_pggate.so)
                #3  0x00007fa96770c15d _ZN2yb6pggate15YBCInitPgGateExEPK12PgTypeEntityi11PgCallbacksPNS0_12PgApiContextE (libyb_pggate.so)
                #4  0x0000000000b22d43 YBInitPostgresBackend (postgres)
                #5  0x0000000000b048f7 InitPostgres (postgres)
                #6  0x00000000005c8c80 BootstrapModeMain (postgres)
                #7  0x00000000005c89f9 AuxiliaryProcessMain (postgres)
                #8  0x00000000007c9dd4 PostgresServerProcessMain (postgres)
                #9  0x00000000004f66e2 main (postgres)
                #10 0x00007fa9666f1825 __libc_start_main (libc.so.6)
                #11 0x00000000004f65e9 _start (postgres)

                Stack trace of thread 1070137:
                #0  0x00007fa9667b79f3 epoll_wait (libc.so.6)
                #1  0x00007fa962b1ff0e _ZN5boost4asio6detail13epoll_reactor3runElRNS1_8op_queueINS1_19scheduler_operationEEE (libyrpc.so)
                #2  0x00007fa962b1d34e _ZN5boost4asio6detail9scheduler3runERNS_6system10error_codeE (libyrpc.so)
                #3  0x00007fa962b1ce55 _ZN2yb3rpc12IoThreadPool4Impl7ExecuteEv (libyrpc.so)
                #4  0x00007fa9629b7c51 _ZN2yb6Thread15SuperviseThreadEPv (libyb_util.so)
                #5  0x00007fa96707a694 start_thread (libpthread.so.0)
                #6  0x00007fa9667b741d __clone (libc.so.6)

                Stack trace of thread 1070141:
                #0  0x00007fa9667b79f3 epoll_wait (libc.so.6)
                #1  0x00007fa9626a8f88 epoll_poll (libev.so.4)
                #2  0x00007fa9626ac9ae ev_run (libev.so.4)
                #3  0x00007fa962b3fe2c _ZN2yb3rpc7Reactor9RunThreadEv (libyrpc.so)
                #4  0x00007fa9629b7c51 _ZN2yb6Thread15SuperviseThreadEPv (libyb_util.so)
                #5  0x00007fa96707a694 start_thread (libpthread.so.0)
                #6  0x00007fa9667b741d __clone (libc.so.6)

                Stack trace of thread 1070138:
                #0  0x00007fa96707f00f pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
                #1  0x00007fa962b1d2d6 _ZN5boost4asio6detail9scheduler3runERNS_6system10error_codeE (libyrpc.so)
                #2  0x00007fa962b1ce55 _ZN2yb3rpc12IoThreadPool4Impl7ExecuteEv (libyrpc.so)
                #3  0x00007fa9629b7c51 _ZN2yb6Thread15SuperviseThreadEPv (libyb_util.so)
                #4  0x00007fa96707a694 start_thread (libpthread.so.0)
                #5  0x00007fa9667b741d __clone (libc.so.6)

                Stack trace of thread 1070143:
                #0  0x00007fa9667b79f3 epoll_wait (libc.so.6)
                #1  0x00007fa9626a8f88 epoll_poll (libev.so.4)
                #2  0x00007fa9626ac9ae ev_run (libev.so.4)
                #3  0x00007fa9629b7c51 _ZN2yb6Thread15SuperviseThreadEPv (libyb_util.so)
                #4  0x00007fa96707a694 start_thread (libpthread.so.0)
                #5  0x00007fa9667b741d __clone (libc.so.6)

                Stack trace of thread 1070140:
                #0  0x00007fa96707f00f pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
                #1  0x00007fa962b1d2d6 _ZN5boost4asio6detail9scheduler3runERNS_6system10error_codeE (libyrpc.so)
                #2  0x00007fa962b1ce55 _ZN2yb3rpc12IoThreadPool4Impl7ExecuteEv (libyrpc.so)
                #3  0x00007fa9629b7c51 _ZN2yb6Thread15SuperviseThreadEPv (libyb_util.so)
                #4  0x00007fa96707a694 start_thread (libpthread.so.0)
                #5  0x00007fa9667b741d __clone (libc.so.6)

                Stack trace of thread 1070142:
                #0  0x00007fa9667b79f3 epoll_wait (libc.so.6)
                #1  0x00007fa9626a8f88 epoll_poll (libev.so.4)
                #2  0x00007fa9626ac9ae ev_run (libev.so.4)
                #3  0x00007fa962b3fe2c _ZN2yb3rpc7Reactor9RunThreadEv (libyrpc.so)
                #4  0x00007fa9629b7c51 _ZN2yb6Thread15SuperviseThreadEPv (libyb_util.so)
                #5  0x00007fa96707a694 start_thread (libpthread.so.0)
                #6  0x00007fa9667b741d __clone (libc.so.6)

                Stack trace of thread 1070139:
                #0  0x00007fa96707f00f pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
                #1  0x00007fa962b1d2d6 _ZN5boost4asio6detail9scheduler3runERNS_6system10error_codeE (libyrpc.so)
                #2  0x00007fa962b1ce55 _ZN2yb3rpc12IoThreadPool4Impl7ExecuteEv (libyrpc.so)
                #3  0x00007fa9629b7c51 _ZN2yb6Thread15SuperviseThreadEPv (libyb_util.so)
                #4  0x00007fa96707a694 start_thread (libpthread.so.0)
                #5  0x00007fa9667b741d __clone (libc.so.6)

GNU gdb (GDB) Red Hat Enterprise Linux 8.2-16.el8
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /nfusr/dev-server/dfelsing/yugabyte-2.14.2.0/postgres/bin/postgres...done.

warning: Can't open file (null) during file-backed mapping note processing

warning: Can't open file (null) during file-backed mapping note processing
[New LWP 1070136]
[New LWP 1070137]
[New LWP 1070141]
[New LWP 1070138]
[New LWP 1070143]
[New LWP 1070140]
[New LWP 1070142]
[New LWP 1070139]

warning: File "/nfusr/dev-server/dfelsing/yugabyte-2.14.2.0/linuxbrew/lib/libthread_db.so.1" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".
To enable execution of this file add
	add-auto-load-safe-path /nfusr/dev-server/dfelsing/yugabyte-2.14.2.0/linuxbrew/lib/libthread_db.so.1
line to your configuration file "/nfusr/dev-server/dfelsing/.gdbinit".
To completely disable this security protection add
	set auto-load safe-path /
line to your configuration file "/nfusr/dev-server/dfelsing/.gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
	info "(gdb)Auto-loading safe path"

warning: Unable to find libthread_db matching inferior's thread library, thread debugging will not be available.

warning: File "/nfusr/dev-server/dfelsing/yugabyte-2.14.2.0/linuxbrew/lib/libthread_db.so.1" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".

warning: Unable to find libthread_db matching inferior's thread library, thread debugging will not be available.
Core was generated by `/nfusr/dev-server/dfelsing/yugabyte-2.14.2.0/postgres/bin/postgres --boot -x1 -'.
Program terminated with signal SIGSEGV, Segmentation fault.
Dwarf Error: Cannot find DIE at 0x499820 referenced from DIE at 0x49182a [in module /nfusr/dev-server/dfelsing/yugabyte-2.14.2.0/lib/yb/libyb_util.so]

I have previously reported this crash at https://github.com/yugabyte/yugabyte-db/issues/13141 Can be reproduced perfectly. bulkcopy.py: https://gist.github.com/def-/0c4eedec1c084eedb2b89855b7ad7b16

Edit2: New attempt is now with the retry logic similar to in yb-voyager: https://gist.github.com/def-/035d33aa52d7ac2b3aeb36d06417ddd4 So far I am seeing correct results with this (python3 ./bulkcopy.py 5760000 288 ...), no missing rows, will try with longer runs now.

def- avatar Sep 27 '22 16:09 def-

What I am seeing now on a 6 node rf3 universe after initially running a truncate table accounts is that there are missing rows in the table:

yugabyte=# select count(*) from accounts;
  count
---------
 6839990
(1 row)

While I was copying in 20000 rows on each call and had 349 successful calls (all rows written according to affected rows), so that should be 6980000 rows, so 140010 rows are missing. 7 of them could be COPY commands that failed but returned success, but then there are still 10 unexplained missing rows. No failing health checks on the universe.

This now happened three times separately, once on a larger cluster seeing row count 585261124 while the writes were all successful 20000 rows affected (python3 ./bulkcopyfile.py 288 35008 ...). Another time on smaller cluster again (./bulkcopyfile.py 400 400 ...): 5257006 rows versus 5440000 rows successfully affected by copies cumulatively. The row counts stay stable though unlike in #14109 https://gist.github.com/def-/478b75fffa5ae1c9e738eb2ac3e101c9 What I noticed is that it's easier to reproduce with a very high load and many retries. Without the retry logic I could not reproduce this. Whether the initial truncate table is responsible, I'm not sure yet.

def- avatar Sep 28 '22 15:09 def-

I cannot reproduce this issue anymore with https://github.com/yugabyte/yugabyte-db/commit/03fc5792142beee569bdb6a0406e86fcfcbde666, closing as fixed.

def- avatar Oct 25 '22 14:10 def-

https://github.com/yugabyte/yugabyte-db/issues/14109 tracks the main problem

rthallamko3 avatar Oct 26 '22 16:10 rthallamko3