mpifileutils icon indicating copy to clipboard operation
mpifileutils copied to clipboard

dsync not updating times correctly

Open dvickernasa opened this issue 2 years ago • 92 comments

Hello,

I recently discovered mpifileutils and they are awesome. I moved about 200 TB in 4 million files between two different lustre filesystems over the weekend. Thank you for your work on these tools. After doing an initial data move, my intent was to keep the data in sync for a period of time using rsync. With the bulk of the data transferred, this should go fast but it would also catch any file types that `dsync doesn't handle (hard links, etc.). Here is the script I'm using.

source=/nobackup/PROJECTS/ccdev/
dest=/nobackup2/PROJECTS/ccdev/

mpiexec --allow-run-as-root dsync -X non-lustre $source $dest > dsync-ccdev.out.$( date +%F )

rsync -aPHASx $source $dest > rsync-ccdev.out.$( date +%F )

The problem I'm running into is that rsync is copying files that dync already copied. Here is one example of a file that rsync is trying to transfer again.

Source:

238412051 -rw-r----- 1 dvicker eg3 244210724864 Aug 23 16:38 /nobackup/PROJECTS/ccdev/boeing/aero/adb/ar/ar-204.squashfs

Destination:

238487228 -rw-r----- 1 dvicker eg3 244210724864 Sep 29 17:15 /nobackup2/PROJECTS/ccdev/boeing/aero/adb/ar/ar-204.squashfs

The file size it correct but it looks like the date on the destination is wrong - this is the date that that the file got created. After poking around in the destination more, it looks like this is prevalent - dates were not preserved. This will trip up rsync's default algorithm of checking by date and size. I could switch to rsync's checksum mode to get around this but that will be quite a bit more time consuming for this dataset.

Is this a bug with dsync or am I possibly doing something wrong? I'm using version 0.11.1 of mpifileutils.

dvickernasa avatar Oct 02 '23 21:10 dvickernasa

Changed the subject since the root problem here is the time stamp info not being preserved.

Here is a more comprehensive example. There is an odd mix of correct and incorrect time stamps in the destination.

[dvicker@monte dsync]$ ls -l /nobackup/PROJECTS/ccdev/spacex/aero/adb/ar/
total 24040207564
  781477845 -rw-r----- 1 dvicker  ccp_spacex   791038201856 Oct 27  2022 ar-108.squashfs
        187 -rw-rw---- 1 dvicker  ccp_spacex         742598 Aug 15 11:12 ar-108.squashfs.toc
 5835780999 -rw-rw---- 1 ema      ccp_spacex  5977433890816 Aug 24 03:22 ar-109.squashfs
       1721 -rw-rw---- 1 ema      ccp_spacex        9413597 Aug 24 03:22 ar-109.squashfs.toc
   30794854 -rw-r----- 1 dvicker  ccp_spacex    25150177280 Oct 29  2022 ar-126.squashfs
         47 -rw-rw---- 1 dvicker  ccp_spacex          89461 Aug 15 11:12 ar-126.squashfs.toc
  561289162 -rw-r----- 1 dvicker  ccp_spacex   568532955136 Oct 29  2022 ar-127.squashfs
        233 -rw-rw---- 1 dvicker  ccp_spacex        1003455 Aug 15 11:12 ar-127.squashfs.toc
 1042993450 -rw-r----- 1 dvicker  ccp_spacex  1056351469568 Oct 29  2022 ar-131.squashfs
       2651 -rw-rw---- 1 dvicker  ccp_spacex       14523295 Aug 15 11:13 ar-131.squashfs.toc
  882730771 -rw-rw---- 1 ema      ccp_spacex   899033808896 Aug 24 11:42 ar-132.squashfs
       4046 -rw-rw---- 1 ema      ccp_spacex       22383607 Aug 24 11:42 ar-132.squashfs.toc
 1777595522 -rw-r----- 1 dvicker  ccp_spacex  1815639597056 Oct 28  2022 ar-133.squashfs
       1721 -rw-rw---- 1 dvicker  ccp_spacex        9352445 Aug 15 11:13 ar-133.squashfs.toc
  101115096 -rw-r----- 1 dvicker  ccp_spacex   102275829760 Oct 28  2022 ar-134.squashfs
         47 -rw-rw---- 1 dvicker  ccp_spacex          93327 Aug 15 11:13 ar-134.squashfs.toc
  753055384 -rw-r----- 1 dvicker  ccp_spacex   771118510080 Oct 30  2022 ar-135.squashfs
        466 -rw-rw---- 1 dvicker  ccp_spacex        2126037 Aug 15 11:13 ar-135.squashfs.toc
  158022760 -rw-r----- 1 dvicker  ccp_spacex   161615790080 Oct 29  2022 ar-136.squashfs
         94 -rw-rw---- 1 dvicker  ccp_spacex         225492 Aug 15 11:13 ar-136.squashfs.toc
11896898702 -rw-r----- 1 dvicker  ccp_spacex 12047965007872 Aug 20 22:27 ar-138.squashfs
      32762 -rw-rw---- 1 dvicker  ccp_spacex       64656431 Aug 20 22:27 ar-138.squashfs.toc
   46078532 -rw-r----- 1 dvicker  ccp_spacex    38567428096 Oct 27  2022 ar-140.squashfs
         47 -rw-rw---- 1 dvicker  ccp_spacex          99596 Aug 15 11:13 ar-140.squashfs.toc
   69136835 -rw-r----- 1 dvicker  ccp_spacex    64205635584 Oct 28  2022 ar-144.squashfs
         47 -rw-rw---- 1 dvicker  ccp_spacex          61816 Aug 15 11:13 ar-144.squashfs.toc
  103192925 -rw-r--r-- 1 dvicker  eg3          105696878592 Aug 22 20:03 Dragon2_RCS_JI_Joe_MSFC.squashfs
        629 -rw-r--r-- 1 dvicker  eg3               3418271 Aug 22 20:03 Dragon2_RCS_JI_Joe_MSFC.squashfs.toc
         36 -rwxr-xr-x 1 aschwing ccp_spacex            203 Feb 27  2019 update_perms.csh*
[dvicker@monte dsync]$ ls -l /nobackup2/PROJECTS/ccdev/spacex/aero/adb/ar/
total 23852316152
  772498852 -rw-r----- 1 dvicker  ccp_spacex   791038201856 Oct 27  2022 ar-108.squashfs
        728 -rw-rw---- 1 dvicker  ccp_spacex         742598 Sep 29 22:48 ar-108.squashfs.toc
 5837342120 -rw-rw---- 1 ema      ccp_spacex  5977433890816 Aug 24 03:22 ar-109.squashfs
       9196 -rw-rw---- 1 ema      ccp_spacex        9413597 Sep 29 22:03 ar-109.squashfs.toc
   24560752 -rw-r----- 1 dvicker  ccp_spacex    25150177280 Sep 29 20:41 ar-126.squashfs
         88 -rw-rw---- 1 dvicker  ccp_spacex          89461 Aug 15 11:12 ar-126.squashfs.toc
  555208412 -rw-r----- 1 dvicker  ccp_spacex   568532955136 Sep 29 23:53 ar-127.squashfs
        980 -rw-rw---- 1 dvicker  ccp_spacex        1003455 Sep 30 00:05 ar-127.squashfs.toc
 1031594016 -rw-r----- 1 dvicker  ccp_spacex  1056351469568 Oct 29  2022 ar-131.squashfs
      14184 -rw-rw---- 1 dvicker  ccp_spacex       14523295 Sep 29 17:44 ar-131.squashfs.toc
  877963384 -rw-rw---- 1 ema      ccp_spacex   899033808896 Sep 29 19:22 ar-132.squashfs
      21860 -rw-rw---- 1 ema      ccp_spacex       22383607 Sep 30 00:14 ar-132.squashfs.toc
 1773086904 -rw-r----- 1 dvicker  ccp_spacex  1815639597056 Sep 29 20:43 ar-133.squashfs
       9136 -rw-rw---- 1 dvicker  ccp_spacex        9352445 Aug 15 11:13 ar-133.squashfs.toc
   99878836 -rw-r----- 1 dvicker  ccp_spacex   102275829760 Sep 30 00:08 ar-134.squashfs
         92 -rw-rw---- 1 dvicker  ccp_spacex          93327 Aug 15 11:13 ar-134.squashfs.toc
  753046000 -rw-r----- 1 dvicker  ccp_spacex   771118510080 Oct 30  2022 ar-135.squashfs
       2080 -rw-rw---- 1 dvicker  ccp_spacex        2126037 Sep 29 22:02 ar-135.squashfs.toc
  157828056 -rw-r----- 1 dvicker  ccp_spacex   161615790080 Sep 29 23:49 ar-136.squashfs
        224 -rw-rw---- 1 dvicker  ccp_spacex         225492 Sep 29 17:19 ar-136.squashfs.toc
11765599492 -rw-r----- 1 dvicker  ccp_spacex 12047965007872 Aug 20 22:27 ar-138.squashfs
      63144 -rw-rw---- 1 dvicker  ccp_spacex       64656431 Sep 29 23:49 ar-138.squashfs.toc
   37663536 -rw-r----- 1 dvicker  ccp_spacex    38567428096 Sep 29 16:54 ar-140.squashfs
        100 -rw-rw---- 1 dvicker  ccp_spacex          99596 Sep 29 22:03 ar-140.squashfs.toc
   62700864 -rw-r----- 1 dvicker  ccp_spacex    64205635584 Oct 28  2022 ar-144.squashfs
         64 -rw-rw---- 1 dvicker  ccp_spacex          61816 Sep 30 02:00 ar-144.squashfs.toc
  103219708 -rw-r--r-- 1 dvicker  eg3          105696878592 Sep 30 02:26 Dragon2_RCS_JI_Joe_MSFC.squashfs
       3340 -rw-r--r-- 1 dvicker  eg3               3418271 Sep 29 16:46 Dragon2_RCS_JI_Joe_MSFC.squashfs.toc
          4 -rwxr-xr-x 1 aschwing ccp_spacex            203 Sep 29 16:46 update_perms.csh*
[dvicker@monte dsync]$ 

dvickernasa avatar Oct 02 '23 21:10 dvickernasa

Hi, @dvickernasa . Great to hear mpiFileUtils has been helpful. Thanks.

dsync should be setting the atime and mtime on the destination files to match the source file. However, it does that as the very last step, after all files have been copied. I wonder if the job might have been interrupted or hit a fatal error before it could complete that step.

Do you see any such errors or early termination from your job output log?

Tip: When copying a large number of files, I like to recommend using the --batch-files option, as in:

dsync --batch-files 100000 ...

This changes dsync to copy files in batches rather than all files at once. It sets the timestamps at the completion of each batch. If the job is interrupted, one can then execute the same dsync command again. dsync skips copying any files that were completed in earlier runs. By default, it avoids copying any file whose destination exists and has a matching file type, size, and mtime.

adammoody avatar Oct 03 '23 20:10 adammoody

That is possible. Here is the end of the output from the dsync run.

[2023-09-30T08:19:28] Syncing directory updates to disk.
[2023-09-30T08:19:33] Sync completed in 4.942 seconds. 
[2023-09-30T08:19:33] Started: Sep-29-2023,16:34:18
[2023-09-30T08:19:33] Completed: Sep-30-2023,08:19:33
[2023-09-30T08:19:33] Seconds: 56715.151
[2023-09-30T08:19:33] Items: 1076815
[2023-09-30T08:19:33]   Directories: 35829
[2023-09-30T08:19:33]   Files: 1037753
[2023-09-30T08:19:33]   Links: 3233
[2023-09-30T08:19:33] Data: 48.152 TiB (52944066031967 bytes)
[2023-09-30T08:19:33] Rate: 890.263 MiB/s (52944066031967 bytes in 56715.151 seconds)
[2023-09-30T08:19:33] Updating timestamps on newly copied files
[2023-09-30T08:19:33] Completed updating timestamps
[2023-09-30T08:19:33] Completed sync
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[49249,1],5]
  Exit code:    1
--------------------------------------------------------------------------   

I'll start using the --batch-files option, thanks for the tip.

I'm attempting to recover from this using an rsync command to update the time stamps.

rsync -rv --size-only --times $source $dest >> rsync-ccdev.out.times.$( date +%F )

But its going fairly slow. Woud just repeating the dsync update the timestamps?

dvickernasa avatar Oct 03 '23 20:10 dvickernasa

Actually, the data I copied was done with two separate dsync runs since they were in two different areas. I've attached the output from both runs. The ccdev run showed the error, but the cev-repo run did not. But the destinations for both runs have files with incorrect time stamps. I will attached the full output of both runs.

Any recommendations for fixing the time stamps would be appreciated. The rsync command I posted above should do it, but its slow. I'm thinking dsync won't work since the output contains Updating timestamps on newly copied files. That makes it sound like dsync won't update the timestamps on the files that were previously copied.

dsync-ccdev.out.2023-09-29.txt dsync-cev-repo.out.2023-09-29.txt

dvickernasa avatar Oct 03 '23 20:10 dvickernasa

I've been doing some tests this afternoon using a smaller dataset. Basically did this:

  • dsync /src /dest
  • cd /dest ; lfs find . | xargs touch
  • dsync /src /dest

Even though no actual data was transferred for the 2nd dsync (I think, the 2nd transfer went fast), the time stamps in the destination got updated to their source time stamps. I used rsync after the 2nd dsync to see if it found any files it wanted to transfer again and it did not, indicating that the size and times of all files did indeed get updated.

I've restarted the dsync's of my original, large datasets so we'll see what happens.

dvickernasa avatar Oct 03 '23 23:10 dvickernasa

Ok, thanks for the update. I'm still puzzled as to why the timestamps aren't being set correctly the first time.

How many nodes and processes are you using?

adammoody avatar Oct 04 '23 00:10 adammoody

Does it do a better job at setting timestamps the first time if you add the --direct option?

adammoody avatar Oct 04 '23 01:10 adammoody

@adammoody is there a reason not to set the timestamps on a file immediately after it is finished processing? That would avoid the risk of missing them being set, and also reduces overhead on the filesystem because it doesn't need to do another set of filename lookups on all of the files.

It is understandable that the timestamps on the directories should be deferred until all of the entries in the directory are updated (and also to avoid setting the directory timestamp millions of times for a large directory), but that is not the case for regular files.

At worst, if a large file is being written by multiple MPI ranks concurrently, the timestamp might be set multiple times, but that is fine given it only happens for very large files, and one of the threads will be the last one writing and will reset the timestamp when it is finished.

Separately, is the "--batch-files" option mainly to work around the lack of timestamp updates on files, or are there other reasons to use it? Should this be set to eg. 10000, or 100 x mpi_ranks, or similar by default? That would increase the directory timestamp updates marginally, but is unlikely to hurt overall transfer performance significantly (I guess some amount of straggler/idle thread overhead for each batch).

adilger avatar Oct 04 '23 06:10 adilger

How many nodes and processes are you using?

I'm using 16 cores on a single workstation for these transfers.

Does it do a better job at setting timestamps the first time if you add the --direct option?

In my small tests, I wasn't using --direct and the times were getting set properly. I verified by following up with an rsync, which didn't copy anything new. I can try a big transfer with --direct next time and see if that works better.

The 2nd dsync for the large dataset just went OK. The intent here was just to correct the times. For some reason, this happened:

[2023-10-03T18:42:03] Comparing file sizes and modification times of 3093774 items
[2023-10-03T18:42:03] Started   : Oct-03-2023, 18:42:01
[2023-10-03T18:42:03] Completed : Oct-03-2023, 18:42:03
[2023-10-03T18:42:03] Seconds   : 2.043
[2023-10-03T18:42:03] Items     : 3093774
[2023-10-03T18:42:03] Item Rate : 3093774 items in 2.042868 seconds (1514426.851390 items/sec)
[2023-10-03T18:42:03] Deleting items from destination
[2023-10-03T18:42:03] Removing 1740489 items ...

<cut>

[2023-10-03T18:46:44] Removed 1676960 items (96.35%) in 280.076 secs (5987.513 items/sec) 10 secs remaining ...
[2023-10-03T18:46:54] Removed 1739691 items (99.95%) in 290.075 secs (5997.386 items/sec) 0 secs remaining ...
[2023-10-03T18:46:54] Removed 1740489 items (100.00%) in 290.399 secs (5993.446 items/sec) done
[2023-10-03T18:46:54] Removed 1740489 items in 290.457 seconds (5992.237 items/sec)
[2023-10-03T18:46:54] Copying items to destination
[2023-10-03T18:46:54] Copying to /nobackup2/PROJECTS/orion/cev-repo
[2023-10-03T18:46:54] Items: 1740490
[2023-10-03T18:46:54]   Directories: 0
[2023-10-03T18:46:54]   Files: 1740490
[2023-10-03T18:46:54]   Links: 0
[2023-10-03T18:46:54] Data: 94.113 TiB (56.699 MiB per file)

The source data is almost completely static so I'm not sure why dsync decided it needed to delete and recopy about half of the data. Perhaps the incorrect dates from the first run threw it off? Its hard to tell for sure but it looked like there were far fewer than 50% of the files with wrong dates in the destination from the 1st dsync. Any thoughts on this mass deletion would be welcome.

dvickernasa avatar Oct 04 '23 17:10 dvickernasa

Regarding all of those deletes, I think that second dsync detected the mismatch in mtime between the source and destination files. It then assumes the files are different, it deletes the destination files, and it re-copies those files from the source. It would obviously have been smoother if the first dsync properly set the mtimes.

From your output logs, I don't see the types of errors that I'd expect to see if the job had been interrupted. It looks to print the message that it has successfully updated all timestamps. It looks buggy to me that it somehow failed.

One guess is that perhaps a trailing write came in to the file after the timestamp update had been processed on the metadata server. That shouldn't be happening, but maybe it is. I don't think that would happen either for a single node run or when using --direct which uses O_DIRECT, which is what looking for with those questions.

I'll also run some tests to see if I can reproduce that locally.

adammoody avatar Oct 04 '23 20:10 adammoody

Thank you for following along and for your suggestions, @adilger .

Yes, we currently set the timestamps once at the end to handle files that are split among multiple ranks. We don't have communication to know when all ranks are done with a given file, so instead we wait until all ranks are done with all files.

The --batch-files option is there to add intermediate global sync points for completing files early. It may provide some secondary benefit to sync procs up more frequently and perhaps improve work balance, but it's mainly there to set timestamps before waiting until the very end.

Picking the best batch size is not well defined. It effectively acts like a checkpoint operation, so it makes sense to pick a batch size based on the expected runtime and failure rate. It also depends on amount of data and the file sizes. I suggested 100k here since there were 4m files in total, so 100k just seemed like a decent checkpoint frequency. Pretty arbitrary...

I like the idea of setting the timestamp multiple times from multiple ranks. That does add extra load on the metadata server, but that's probably not a huge hit.

However, a remaining challenge is to figure out how to handle setting other properties like ownership, permission bits, and extended attributes. We set those at the end, and then we set the timestamps as the very last step. This ensures that the mtime on the destination only matches the source after the file has been fully copied (all data and attributes).

I know some properties can't be set too early, since for example, some of those can disable write access to the destination file.

And I think we'd need to consider the case where dsync might be interrupted and leaves a destination file with the correct size and mtime before all ranks had actually completed their writes. (The file size can be correct since the last bytes in the file might actually be written first.) We could perhaps write to a temporary file and rename to handle that, but then I think we run into a similar problem of knowing when it's time to do the rename.

Have more insight into those?

adammoody avatar Oct 04 '23 21:10 adammoody

Regarding all of those deletes, I think that second dsync detected the mismatch in mtime between the source and destination files. It then assumes the files are different, it deletes the destination files, and it re-copies those files from the source. It would obviously have been smoother if the first dsync properly set the mtimes.

Given this, I think @adilger's suggestion of writting the times right after the file write makes a lot of sense if its not too much of a performance hit. If dsync's behavior is to delete the file and start over again if the times don't match, it would be very beneficial to get the times updated ASAP so that if the transfer is interrupted, the next dsync can avoid as much re-transfer as possible.

One other possibility is implementing rsync's --partial behavior. If that is turned on, the part of the file that is transferred is kept and subsequent rsync runs pick up where the the previous left off. Of course, this would also require implementing the delta-transfer algorithm and I have no idea how much effort that would entail.

dvickernasa avatar Oct 04 '23 22:10 dvickernasa

One option for detecting partial file transfer would be to check the blocks allocated to the file against the source file. If they are grossly smaller on the target, then it is clearly not finished copying the interior.

Alternately (and this might be better for performance reasons as well) you could make the work distribution function smarter and have only a single rank per OST object and copy the objects linearly to avoid holes. With modern Lustre, large O_DIRECT writes are much faster than buffered writes 20GB/s vs. 2GB/s).

Also, I imagine you only split a file into multiple parts to copy in parallel once it is over 1GiB in allocated space, so the cost of doing an MPI comm to the other ranks is pretty tiny compared to the file data moved, so making the client do multiple locks/opens for these files is probably more work in total.

adilger avatar Oct 05 '23 01:10 adilger

Sorry, I know this isn't directly related to this issue, but I have a question about dsync's behavior. Here is an example of the end of a transfer.

[2023-10-03T19:32:44] Copied 2.673 TiB (66%) in 2738.947 secs (0.999 GiB/s) 1438 secs left ...
[2023-10-03T19:32:51] Copied 2.684 TiB (66%) in 2746.006 secs (1.001 GiB/s) 1424 secs left ...
[2023-10-03T19:33:04] Copied 2.695 TiB (66%) in 2759.073 secs (1.000 GiB/s) 1414 secs left ...
[2023-10-03T19:33:14] Copied 2.707 TiB (66%) in 2769.405 secs (1.001 GiB/s) 1401 secs left ...
[2023-10-03T19:33:24] Copied 2.719 TiB (67%) in 2779.124 secs (1.002 GiB/s) 1388 secs left ...
[2023-10-03T19:33:34] Copied 2.730 TiB (67%) in 2789.049 secs (1.002 GiB/s) 1376 secs left ...
[2023-10-03T19:33:49] Copied 2.738 TiB (67%) in 2804.159 secs (1.000 GiB/s) 1370 secs left ...
[2023-10-03T19:34:04] Copied 2.755 TiB (68%) in 2818.953 secs (1.001 GiB/s) 1353 secs left ...
[2023-10-03T19:34:15] Copied 2.766 TiB (68%) in 2829.957 secs (1.001 GiB/s) 1341 secs left ...
[2023-10-03T19:34:21] Copied 2.777 TiB (68%) in 2836.497 secs (1.002 GiB/s) 1328 secs left ...
[2023-10-03T19:34:39] Copied 2.788 TiB (68%) in 2854.734 secs (1.000 GiB/s) 1320 secs left ...
[2023-10-03T19:34:48] Copied 2.802 TiB (69%) in 2863.235 secs (1.002 GiB/s) 1303 secs left ...
[2023-10-03T19:35:08] Copied 2.813 TiB (69%) in 2883.456 secs (0.999 GiB/s) 1295 secs left ...
[2023-10-03T20:09:39] Copied 2.831 TiB (69%) in 4954.226 secs (599.206 MiB/s) 2180 secs left ...
[2023-10-03T20:09:39] Copied 4.077 TiB (100%) in 4954.227 secs (862.846 MiB/s) done
[2023-10-03T20:09:39] Copy data: 4.077 TiB (4482383360736 bytes)
[2023-10-03T20:09:39] Copy rate: 862.846 MiB/s (4482383360736 bytes in 4954.227 seconds)
[2023-10-03T20:09:39] Syncing data to disk.
[2023-10-03T20:09:45] Sync completed in 6.131 seconds.

Note that the percentage is progressing along fairly linearly, then suddenly jumps a lot and finishes much earlier than predicted. I see this happen a lot. Do you know what's going on there?

My 2nd dsync of the large dataset finished and I have an rsync running to verify it doesn't find anything else to transfer. Its going to take awhile for that to finish but its been running most of the day and it hasn't found anything yet, which is a good sign. My original rsync verification starting finding files it wanted to transfer almost right away.

dvickernasa avatar Oct 05 '23 22:10 dvickernasa

The missing progress messages is a known problem. This is also noted in https://github.com/hpc/mpifileutils/issues/530

These progress messages are implemented using a tree-based communication pattern across ranks using point-to-point messages. Every 10 seconds or so, rank 0 initiates a message down the tree to request progress. Each intermediate rank receives that request message from its parent and forwards it down to its children. When the request message reaches the leaves, ranks respond and send their current progress status back up the tree. Intermediate ranks receive that data, merge in their own progress status, and respond back up the tree. Rank 0 receives the messages from its children, merges its data, and prints the final result. These reductions are asynchronous. Ranks periodically check for and respond to these messages as they come in.

I think what probably happens for these progress messages to get stalled is that some rank (any rank) gets stuck in an I/O call. Because it is blocked in the I/O call, it doesn't respond to MPI messages, which then blocks the progress tree. If that's what's happening, one solution would be to move the I/O calls into a pthread or maybe use async I/O routines so that the process can still make MPI calls while waiting on its current I/O operation to complete.

adammoody avatar Oct 06 '23 18:10 adammoody

Is it possible to set a timeout on the MPI broadcast that doesn't also result in a total MPI failure? Since the status updates are somewhat advisory, unlike an HPC calculation, getting a partial status is probably fine.

adilger avatar Oct 06 '23 20:10 adilger

Yes, I think timeouts on the reduction operation could also work. I think we'd need to pair that with adding sequence numbers to the messages or internal counters so that a parent can distinguish late-arriving responses from its children.

It might also get strange, since the progress "sum" would be missing any results from the subtree rooted at the blocked rank. The reported progress would appear to bounce up and down during the run if such stalls are frequent as random ranks come and go. That might be countered by caching the last reported value from each child. I'd have to check whether that makes sense for the various reduction ops...

There may be another advantage to moving the I/O to a thread. For algorithms that use libcircle for work stealing, work on blocked ranks could be moved to other ranks faster. Having said that, changing all of the tools to move I/O to a separate thread is a bigger overhaul.

adammoody avatar Oct 06 '23 21:10 adammoody

Oh, wait. I'm describing the point-to-point reduction in libcircle:

https://github.com/hpc/libcircle/blob/406a0464795ef5db783971e578a798707ee79dc8/libcircle/token.c#L97

There is a separate progress reduction algorithm in mpiFileUtils, and I forgot that it uses non-blocking MPI collectives for that:

https://github.com/hpc/mpifileutils/blob/main/src/common/mfu_progress.c

That'd probably first need to be switched over to a point-to-point method (like in libcircle) to add timeouts.

adammoody avatar Oct 06 '23 21:10 adammoody

Since this is getting a bit involved, let's migrate discussion about missing progress messages to issue https://github.com/hpc/mpifileutils/issues/530.

adammoody avatar Oct 06 '23 23:10 adammoody

I've got a few more datapoints on setting the times.

First of all, the rsync verification of our 141 TB, 2.7M files dataset is still going. Its been running since 10/5. We closed off the permissions of the top level directory to ensure people can't access the data. So far, rsync has found about a couple dozen files that it wanted to re-transfer.

Over the weekend, we had another few sets of data we needed to transfer, each set containing about 250 GB and 1.7M files. Using a single node we can transfer one set in less than an hour using dsync. However, we found more indications of timestamps getting set incorrectly. This time we kept running dsync. After 2 or 3 times of dsync finding new files to transfer, we could get to the point where dsync itself no longer found new files to transfer. This was still not using --direct. The next time I use dsync, I'll try using --direct.

dvickernasa avatar Oct 11 '23 22:10 dvickernasa

To be sure, your latest transfer of 1.7M files saw mismatched timestamps, and this only ran dsync using a single node, right? That info helps to eliminate one theory I had.

And this is a Lustre-to-Lustre transfer, right?

Did you see any errors reported in the output of that first transfer?

adammoody avatar Oct 13 '23 16:10 adammoody

Correct - single node and lustre to lustre.

dvickernasa avatar Oct 16 '23 14:10 dvickernasa

Sorry, to answer your questions about errors, Yes, there were some errors. They did a couple intermediate transfers while the datasets were in use and being modified in an effort to reduce the time it took to transfer the data when the runs were completed.

ERROR: Source file `XXX' shorter than expected 4007 (errno=0 Success)

Although, I found this odd (the source data was smaller not larger), we didn't spend any time digging into this.

One the job were done and the data was static, the transfer did show just a couple of errors.

ERROR: Failed to open output file `XXX' (errno=13 Permission denied)
ERROR: Failed to change ownership on `XXX' lchown() (errno=13 Permission denied)

Which is kind of strange since the source and destination directories were both owned by the person doing the transfer. One more dsync run got all files and finished without errors.

dvickernasa avatar Oct 16 '23 14:10 dvickernasa

FYI, I started another transfer on Thursday. This one was quite large WRT the number of files (44.7M)

[2023-10-12T14:42:45] Copying to /nobackup2/PROJECTS/orion/gnc/em1data-copy
[2023-10-12T14:42:46] Items: 44656137
[2023-10-12T14:42:46]   Directories: 3593494
[2023-10-12T14:42:46]   Files: 39432055
[2023-10-12T14:42:46]   Links: 1630588
[2023-10-12T14:42:46] Data: 49.134 TiB (1.307 MiB per file)

I used --direct from the beginning on this one.

source=/.scratch00/agdl/em1data/
dest=/nobackup2/PROJECTS/orion/gnc/em1data-copy/
dsync_opts='-X non-lustre --batch-files 100000 --direct'

mpiexec --allow-run-as-root dsync $dsync_opts $source $dest >> dsync-em1.out.$( date +%F-%H-%M )

The scratch filesystem is NFS so this is a an NFS to lustre transfer. Unfortunately, it looks like this one died sometime on Saturday.

[2023-10-14T11:13:45] Copied 262.948 GiB (80%) in 851.497 secs (316.218 MiB/s) 219 secs left ...
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 4679 on node monte exited on signal 9 (Killed).
--------------------------------------------------------------------------

I'm not sure why it died yet - I'll be digging into that some more. However, after poking around in the partially transferred directory structure, it looks like all directories are owned by root. Or at least most of them. Here is an example of the top level directory.

[root@dvicker dsync]# ls -l /.scratch00/agdl/em1data/
total 20381086
drwxrws---  41 root     mpcvdata          57 Mar  6  2023 AR1_MER_GEM
drwxrwxrwx  11 mgualdon cev               11 Mar 21  2023 AR1_MER_POLARIS
-rw-r--r--   1 lhmolina mpcvdata 18450288381 Dec 10  2018 Data_20180219_0300.tgz
drwxr-sr-x   2 root     mpcvdata     2206262 Dec  7  2020 dumap_html
-rw-r--r--   1 root     mpcvdata        2782 Dec  7  2020 dumap.html
-rw-r--r--   1 root     mpcvdata  6744041294 Dec  4  2020 du.txt
drwxr-sr-x   8 choltz   mpcvdata           8 Jun 26  2018 EM1_DATA
drwxr-sr-x   5 choltz   mpcvdata           5 Aug 24  2016 EM1_FSW
drwxrws---   5 lmcev    mpcvdata           5 Jul 13  2020 EM1_iVAC
drwxrws---+ 13 choltz   mpcvdata          13 Jan 25  2023 EM1_RUNS
drwxr-sr-x  20 mfrostad mpcvdata          20 Jan 14  2021 EM1_SLS_Traj
drwxrws---   2 idux     mpcvdata           4 Apr 21  2021 EM1_TMT_Database
drwxr-sr-x   6 dheacock mpcvdata           6 Jul 13  2020 EM1_VAC
drwxr-s---   3 jarnold1 em1tppi            3 Aug  8  2019 EM_Flight_Traj_Limited_Rights
drwxr-sr-x  10 choltz   mpcvdata          16 May  3  2016 jfreechart-1.0.19
drwxrwsrwx   4 jmarcy   mpcvdata           4 Aug 10  2022 jmarcy
-rwxrwxrwx   1 jmarcy   mpcvdata      659220 Jan 19  2021 MEP_28ep9_20201215.xml
drwxrwsrwx   2 mstoft   mpcvdata           2 Jan 11  2022 mstoft
drwxrwsr-x+ 10 bawood   mpcvdata          14 May  8 13:58 orion_veras_archive
drwxr-sr-x   3 lgefert  mpcvdata           3 Feb  7  2019 prop_data
drwxrwsrwx  16 jmarcy   mpcvdata          16 Sep 21  2020 ramtares
drwxrwsrwx   6 mstoft   mpcvdata           7 Feb 10  2021 Rel28eP9Eng_Debug
drwxr-sr-x   5 choltz   mpcvdata           6 Nov 27  2017 socgnc_base
drwxr-s---  36 choltz   em1tppi           39 Oct 17  2022 socgnc_releases
drwxr-sr-x   5 choltz   mpcvdata           5 Mar 22  2016 SSW
[root@dvicker dsync]# ls -l /nobackup2/PROJECTS/orion/gnc/em1data-copy/
total 36212336
drwx------  41 root     root            4096 Oct 14 10:16 AR1_MER_GEM
drwx------  11 root     root            4096 Oct 12 14:42 AR1_MER_POLARIS
-rw-r--r--   1 lhmolina mpcvdata 18450288381 Dec 10  2018 Data_20180219_0300.tgz
drwx------   2 root     root       180535296 Oct 14 10:08 dumap_html
drwx------   8 root     root            4096 Oct 12 14:42 EM1_DATA
drwx------   5 root     root            4096 Oct 12 14:42 EM1_FSW
drwx------   5 root     root            4096 Oct 12 14:42 EM1_iVAC
drwxrwx---+ 13 root     root            4096 Oct 12 14:42 EM1_RUNS
drwx------  20 root     root            4096 Oct 12 14:42 EM1_SLS_Traj
drwx------   2 root     root            4096 Oct 12 14:42 EM1_TMT_Database
drwx------   6 root     root            4096 Oct 12 14:42 EM1_VAC
drwx------   3 root     root            4096 Oct 12 14:42 EM_Flight_Traj_Limited_Rights
drwx------  10 root     root            4096 Oct 14 10:05 jfreechart-1.0.19
drwx------   4 root     root            4096 Oct 12 14:42 jmarcy
drwx------   2 root     root            4096 Oct 12 14:42 mstoft
drwx------+ 10 root     root            4096 Oct 14 10:05 orion_veras_archive
drwx------   3 root     root            4096 Oct 12 14:42 prop_data
drwx------  16 root     root            4096 Oct 12 14:42 ramtares
drwx------   6 root     root            4096 Oct 12 14:42 Rel28eP9Eng_Debug
drwx------   5 root     root            4096 Oct 12 14:42 socgnc_base
drwx------  36 root     root            4096 Oct 12 18:55 socgnc_releases
drwx------   5 root     root            4096 Oct 12 14:42 SSW
[root@dvicker dsync]# 

Is this expected?

dvickernasa avatar Oct 16 '23 15:10 dvickernasa

A few updates.

The first dsync run of this latest transfer died due to an OOM event on the machine I was running on. I think it was dsync itself that caused the OOM, since this was a pretty isolated machine and nothing else was running there. But I'm not 100% on that. I didn't mention it before, but this is also a single node run.

The sync is restarted and, unfortunately, I'm having similar issues. dsync decided to delete a lot of what it already transferred.

[2023-10-16T11:22:01] Comparing file sizes and modification times of 17155249 items
[2023-10-16T11:22:19] Started   : Oct-16-2023, 11:21:26
[2023-10-16T11:22:19] Completed : Oct-16-2023, 11:22:19
[2023-10-16T11:22:19] Seconds   : 53.312
[2023-10-16T11:22:19] Items     : 17155249
[2023-10-16T11:22:19] Item Rate : 17155249 items in 53.312461 seconds (321786.852653 items/sec)
[2023-10-16T11:22:19] Deleting items from destination
[2023-10-16T11:22:19] Removing 10915151 items

I'm attaching the full output for both the original and the restart for this latest run.

dsync-em1.out.2023-10-16-09-50.txt dsync-em1.out.2023-10-12-12-52.txt

dvickernasa avatar Oct 16 '23 16:10 dvickernasa

Ugh, that's frustrating. I suspect that the timestamps for the files on that first run are still not being set correctly.

It looks like the first run successfully completed a number of 100k file-transfer batches before it died (193 batches to be precise).

[2023-10-14T10:56:04] Copied 19300000 of 44656137 items (43.219%)

So at most it should have had to delete about 100k files, which would be the files just from the batch being worked when the run died. Instead, it thought 11m million files needed to be replaced.

It seems like it's either missing setting the timestamps on some files, or perhaps it's hitting an error and not reporting it. I'll review the code again with an eye for those kinds of problem.

You might as well drop --direct for future runs. We now know that doesn't seem to help, and it may slow down your transfers. Thanks for checking that.

Having root ownership left on the directories is expected, because the directory permissions are not updated until all files have been transferred. Since the first run didn't complete, it didn't get a chance to change the ownership on the directories. However, it should have changed ownership and set timestamps on all of those previous files.

It'd be helpful if we could find a reproducer where you don't have to transfer a ton of data. You mentioned some smaller transfers worked on the first attempt.

Do you have any idea of how large you have to get before you see things break?

adammoody avatar Oct 16 '23 20:10 adammoody

You might as well drop --direct for future runs. We now know that doesn't seem to help, and it may slow down your transfers. Thanks for checking that.

No problem and I will drop this for future runs.

It'd be helpful if we could find a reproducer where you don't have to transfer a ton of data. You mentioned some smaller transfers worked on the first attempt.

Do you have any idea of how large you have to get before you see things break?

Yes, I agree. I will see if I can find a small-ish dataset to reproduce this.

dvickernasa avatar Oct 16 '23 21:10 dvickernasa

This will produce some long output, but let's add some debug statements to print info when dsync sets the timestamp on a file.

Can you modify src/common/mfu_io.c to add the printf and preceding formatting lines as shown below and rebuild?

int mfu_utimensat(int dirfd, const char* pathname, const struct timespec times[2], int flags)
{   
    int rc;
    int tries = MFU_IO_TRIES;
retry:
    errno = 0;
    rc = utimensat(dirfd, pathname, times, flags);
    if (rc != 0) {
        if (errno == EINTR || errno == EIO) {
            tries--;
            if (tries > 0) {
                /* sleep a bit before consecutive tries */
                usleep(MFU_IO_USLEEP);
                goto retry;
            }
        }
    }

    //------------------------------
    char access_s[30];
    char modify_s[30]; 
    size_t access_rc = strftime(access_s, sizeof(access_s) - 1, "%FT%T", localtime(&times[0].tv_sec));
    size_t modify_rc = strftime(modify_s, sizeof(modify_s) - 1, "%FT%T", localtime(&times[1].tv_sec));
    printf("%s rc=%d errno=%d atime_sec=%llu atime_ns=%llu mtime_sec=%llu mtime_ns=%llu atime=%s mtime=%s\n",
        pathname, rc, errno, times[0].tv_sec, times[0].tv_nsec, times[1].tv_sec, times[1].tv_nsec, access_s, modify_s);
    //------------------------------

    return rc;
}

Also add an #include <time.h> statement at the top of the file.

Execute a dsync and if you find any files have mismatching timestamps, look for those same paths in the output. Let's verify that a line is actually printed, that the input timestamps are correct, and that the utimensat call returns with rc=0.

BTW, it seems like it's not too hard to find files with mismatched timestamps, but if you need to scan a lot of files manually, we can probably use dcmp to help with that.

No need to copy the actual output here, but let me know what you find.

adammoody avatar Oct 16 '23 23:10 adammoody

Lots of updates....

Code changes

Sure, I can do that. If we keep going down this road, it might be nice to make a branch I can check out and compile against so I'm not making the changes by hand.

Smaller test case

I tried to find a smaller test case. Picked a directory tree with 400k files in 4.8 TB. Did the initial transfer last night in about 2 hours. Repeated the transfer this morning and it only found 1 file to re-transfer. So perhaps this is a better test case to use for testing.

Update on latest large transfer

This is the 40M file case. I had restarted the transfer yesterday but it crashed again just now. If you want me to create a different issue for this, I can. Console output for this crash:

[monte:24626] *** Process received signal ***
[monte:24626] Signal: Segmentation fault (11)
[monte:24626] Signal code: Address not mapped (1)
[monte:24626] Failing at address: 0x80
[monte:24626] [ 0] /lib64/libpthread.so.0(+0xf630)[0x7fc689c8c630]
[monte:24626] [ 1] /software/x86_64/mpi/openmpi-1.8.6-gcc-4.8/lib/openmpi/mca_oob_tcp.so(+0xb1b7)[0x7fc687c5b1b7]
[monte:24626] [ 2] /software/x86_64/mpi/openmpi-1.8.6-gcc-4.8/lib/openmpi/mca_oob_tcp.so(mca_oob_tcp_peer_recv_connect_ack+0xbc2)[0x7fc687c5c222]
[monte:24626] [ 3] /software/x86_64/mpi/openmpi-1.8.6-gcc-4.8/lib/openmpi/mca_oob_tcp.so(+0x69f1)[0x7fc687c569f1]
[monte:24626] [ 4] /software/x86_64/mpi/openmpi-1.8.6-gcc-4.8/lib/libopen-pal.so.6(opal_libevent2021_event_base_loop+0x7a4)[0x7fc68aa16924]
[monte:24626] [ 5] mpiexec[0x404d52]
[monte:24626] [ 6] mpiexec[0x403610]
[monte:24626] [ 7] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7fc6898d1555]
[monte:24626] [ 8] mpiexec[0x403529]
[monte:24626] *** End of error message ***
./dsync-em1.sh: line 22: 24626 Segmentation fault      mpiexec --allow-run-as-root dsync $dsync_opts $source $dest >> dsync-em1.out.$( date +%F-%H-%M )

Here are the /var/log/messages for this:

Oct 17 10:58:37 monte kernel: dsync[24639]: segfault at 80 ip 00007f6cd1ec01b7 sp 00007f6cd1cafce0 error 4 in mca_oob_tcp.so[7f6cd1eb5000+13000]
Oct 17 10:58:37 monte kernel: dsync[24637]: segfault at 80 ip 00007ff86669d1b7 sp 00007ff86648cce0 error 4 in mca_oob_tcp.so[7ff866692000+13000]
Oct 17 10:58:37 monte kernel: dsync[24683]: segfault at 80 ip 00007f410599e1b7 sp 00007f410578dce0 error 4
Oct 17 10:58:37 monte kernel: dsync[24680]: segfault at 80 ip 00007ff888bf51b7 sp 00007ff8889e4ce0 error 4
Oct 17 10:58:37 monte kernel: dsync[24684]: segfault at 80 ip 00007f3725a891b7 sp 00007f3725878ce0 error 4
Oct 17 10:58:37 monte kernel: dsync[24682]: segfault at 80 ip 00007fd6cae651b7 sp 00007fd6cac54ce0 error 4
Oct 17 10:58:37 monte kernel: in mca_oob_tcp.so[7f3725a7e000+13000]
Oct 17 10:58:37 monte kernel: in mca_oob_tcp.so[7ff888bea000+13000]
Oct 17 10:58:37 monte kernel: 
Oct 17 10:58:37 monte kernel: in mca_oob_tcp.so[7fd6cae5a000+13000]
Oct 17 10:58:37 monte kernel: 
Oct 17 10:58:37 monte kernel: in mca_oob_tcp.so[7f4105993000+13000]
Oct 17 10:58:37 monte kernel: Pid 24670(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Pid 24674(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: dsync[24665]: segfault at 80 ip 00007f5faa6201b7 sp 00007f5faa40fce0 error 4 in mca_oob_tcp.so[7f5faa615000+13000]
Oct 17 10:58:37 monte kernel: dsync[24673]: segfault at 80 ip 00007fd64f05b1b7 sp 00007fd64ee4ace0 error 4
Oct 17 10:58:37 monte kernel: dsync[24669]: segfault at 80 ip 00007fd2215721b7 sp 00007fd221361ce0 error 4
Oct 17 10:58:37 monte kernel: dsync[24647]: segfault at 80 ip 00007fdb7e2401b7 sp 00007fdb7e02fce0 error 4
Oct 17 10:58:37 monte kernel: in mca_oob_tcp.so[7fd221567000+13000]
Oct 17 10:58:37 monte kernel: 
Oct 17 10:58:37 monte kernel: in mca_oob_tcp.so[7fdb7e235000+13000]
Oct 17 10:58:37 monte kernel: in mca_oob_tcp.so[7fd64f050000+13000]
Oct 17 10:58:37 monte kernel: Pid 24634(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Pid 24636(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Pid 24640(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Pid 24651(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Pid 24632(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Pid 24644(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Pid 24664(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Pid 24668(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Pid 24660(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Pid 24658(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Pid 24646(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Pid 24666(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Pid 24635(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Pid 24656(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Pid 24654(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Pid 24633(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Pid 24662(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Pid 24631(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Pid 24638(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Pid 24648(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Pid 24652(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Pid 24642(dsync) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte kernel: Pid 24626(mpiexec) over core_pipe_limit
Oct 17 10:58:37 monte kernel: Skipping core dump
Oct 17 10:58:37 monte abrt-hook-ccpp: Process 24629 (dsync) of user 0 killed by SIGSEGV - ignoring (repeated crash)
Oct 17 10:58:37 monte abrt-hook-ccpp: Process 24677 (dsync) of user 0 killed by SIGSEGV - dumping core
Oct 17 10:58:37 monte abrt-hook-ccpp: Process 24672 (dsync) of user 0 killed by SIGSEGV - ignoring (repeated crash)
Oct 17 10:58:37 monte abrt-hook-ccpp: Process 24630 (dsync) of user 0 killed by SIGSEGV - ignoring (repeated crash)
Oct 17 10:58:38 monte kernel: RPC: fragment too large: 612067950
Oct 17 10:59:05 monte abrt-hook-ccpp: Saved core dump of pid 24672 to core.24672 at /home/dvicker/dsync (3906400256 bytes)
Oct 17 10:59:05 monte kernel: RPC: fragment too large: 612067950
Oct 17 10:59:08 monte abrt-hook-ccpp: Saved core dump of pid 24630 to core.24630 at /home/dvicker/dsync (5385498624 bytes)
Oct 17 10:59:11 monte kernel: LNetError: 3293:0:(acceptor.c:431:lnet_acceptor()) Refusing connection from 128.157.254.142: insecure port 40552
Oct 17 10:59:20 monte abrt-hook-ccpp: Saved core dump of pid 24629 to core.24629 at /home/dvicker/dsync (5427765248 bytes)
Oct 17 10:59:21 monte kernel: LNetError: 3293:0:(acceptor.c:431:lnet_acceptor()) Refusing connection from 128.157.254.142: insecure port 34726
Oct 17 10:59:21 monte kernel: LNetError: 3293:0:(acceptor.c:431:lnet_acceptor()) Skipped 9 previous similar messages
Oct 17 10:59:38 monte abrtd: Size of '/var/spool/abrt' >= 5000 MB (MaxCrashReportsSize), deleting old directory 'ccpp-2023-09-21-11:01:07-18361'
Oct 17 10:59:38 monte abrtd: Size of '/var/spool/abrt' >= 5000 MB (MaxCrashReportsSize), deleting old directory 'vmcore-127.0.0.1-2018-07-12-08:53:30'
Oct 17 10:59:38 monte abrtd: Size of '/var/spool/abrt' >= 5000 MB (MaxCrashReportsSize), deleting old directory 'ccpp-2023-07-28-16:59:54-16739'
Oct 17 10:59:38 monte abrt-server: '/var/spool/abrt/oops-2018-08-06-08:55:25-23722-0' is not a problem directory
Oct 17 10:59:39 monte abrt-server: '/var/spool/abrt/oops-2018-08-06-08:55:53-23986-0' is not a problem directory
Oct 17 10:59:39 monte abrt-server: '/var/spool/abrt/oops-2018-08-06-08:55:25-23722-0' is not a problem directory
Oct 17 10:59:40 monte abrt-server: '/var/spool/abrt/oops-2018-08-06-08:55:53-23986-0' is not a problem directory
Oct 17 10:59:40 monte abrt-server: '/var/spool/abrt/oops-2018-08-06-08:55:25-23722-0' is not a problem directory
Oct 17 10:59:41 monte abrt-server: '/var/spool/abrt/oops-2018-08-06-08:55:53-23986-0' is not a problem directory
Oct 17 10:59:41 monte abrt-server: '/var/spool/abrt/oops-2018-08-06-08:55:25-23722-0' is not a problem directory
Oct 17 10:59:42 monte abrt-server: '/var/spool/abrt/oops-2018-08-06-08:55:53-23986-0' is not a problem directory
Oct 17 10:59:42 monte abrt-server: Generating core_backtrace
Oct 17 10:59:45 monte abrt-server: '/var/spool/abrt/oops-2018-08-06-08:55:25-23722-0' is not a problem directory
Oct 17 10:59:45 monte abrt-server: '/var/spool/abrt/oops-2018-08-06-08:55:53-23986-0' is not a problem directory
Oct 17 10:59:45 monte abrt-server: Email address of sender was not specified. Would you like to do so now? If not, 'user@localhost' is to be used [y/N] 
Oct 17 10:59:45 monte abrt-server: Email address of receiver was not specified. Would you like to do so now? If not, 'root@localhost' is to be used [y/N] 
Oct 17 10:59:45 monte abrt-server: Undefined variable outside of [[ ]] bracket
Oct 17 10:59:45 monte abrt-server: Sending an email...
Oct 17 10:59:45 monte abrt-server: Sending a notification email to: root@localhost
Oct 17 10:59:45 monte abrt-server: Email was sent to: root@localhost

This created 4 core files if you want them. This is different than the last crash, which was an OOM event.

I don't want to restart this sync again until I have confidence that its not going to delete a bunch of files again. So lets do some debugging on some smaller cases first.

dvickernasa avatar Oct 17 '23 16:10 dvickernasa

Good idea regarding the branch. I've got one with the printf added here:

https://github.com/hpc/mpifileutils/tree/debugtimestamps

I don't want to restart this sync again until I have confidence that its not going to delete a bunch of files again. So lets do some debugging on some smaller cases first.

Yes, that makes total sense. I'm hoping the smaller test case works, as well.

If we can get a stack trace from the core files, that would be helpful.

Do you get something with gdb?

adammoody avatar Oct 17 '23 18:10 adammoody