ltp icon indicating copy to clipboard operation
ltp copied to clipboard

LTP rwtest04: mmap-write + msync shows partial data visibility (page tear) on immediate verify-read

Open clinx opened this issue 4 weeks ago • 2 comments

Environment

  • Workload: LTP fs/rwtest04
  • Reproducer:
    • Command: /opt/ltp/runltp -Q -b /root/lin/mnt002 -f fs -s rwtest04 -d /root/lin/mnt002
    • File: /root/lin/mnt002/ltp-*/mm-sync-<pid>
  • Client: FUSE-based file I/O (verify uses O_RDONLY)
  • Observation time: 2025-11-12 17:12:45.xxx

Expected behavior

After memcpy to a MAP_SHARED mapping and msync(..., MS_SYNC) returns 0, a verification read of the same range should observe fully updated data.

Actual behavior

Immediate verify-read returns a mix of old and new data for the same range (data comparison error). Logs show:

  • Old 12KB content was persisted just before the mmap memcpy
  • The subsequent msync appears to have flushed only the first 4KB page
  • The verify-read hits cached data prepared earlier by a window-aligned backend read, so its syscall time is very short and “doesn’t align” with the backend request timing

User-space timeline (doio)

  • mmap-write memcpy (offset 4442856, len 9927):
doio_debug_log("mmap-%s memcpy done: fd=%d offset=%d copied=%d memaddr=0x%lx duration_ns=%lld\n",
               (sysc->sy_flags & SY_WRITE) ? "write" : "read",
               fd, req->r_data.io.r_offset, req->r_data.io.r_nbytes,
               (unsigned long)memaddr, ...);
[2025-11-12 17:12:45.977696142] mmap-write memcpy done: fd=3 offset=4442856 copied=9927 memaddr=0x7f6663c3cae8 duration_ns=9430449
  • msync start and completion:
doio_debug_log("msync: fd=%d offset=%d base=0x%lx len=%d flags=MS_SYNC\n",
               fd, req->r_data.io.r_offset, (unsigned long)fdc->c_memaddr,
               (int)sbuf.st_size);
msret = msync(fdc->c_memaddr, (int)sbuf.st_size, MS_SYNC);
doio_debug_log("msync: fd=%d offset=%d ret=%d errno=%d duration_ns=%lld\n", ...);
[2025-11-12 17:12:45.977715758] msync: fd=3 offset=4442856 base=0x7f6663800000 len=12800000 flags=MS_SYNC
[2025-11-12 17:12:45.982667206] msync: fd=3 offset=4442856 ret=0 errno=0 duration_ns=4937949
  • verify-read (independent O_RDONLY fd) right after msync:
doio_debug_log("verify: fd=%d file=%s offset=%d len=%d\n", fd, file, offset, length);
doio_debug_log("verify-lseek: fd=%d offset=%d ret=%d errno=%d duration_ns=%lld\n", ...);
doio_debug_log("verify-read: fd=%d offset=%d buf=0x%lx len=%d ret=%d errno=%d duration_ns=%lld\n", ...);
[2025-11-12 17:12:45.982712405] verify: fd=4 file=/root/lin/mnt002/ltp-np5plAMCXo/mm-sync-107502 offset=4442856 len=9927
[2025-11-12 17:12:45.982726786] verify-lseek: fd=4 offset=4442856 ret=4442856 errno=0 duration_ns=1226
[2025-11-12 17:12:45.982805728] verify-read: fd=4 offset=4442856 buf=0x1a85c70 len=9927 ret=9927 errno=0 duration_ns=65796
  • Data comparison error (mixed old/new):
*** DATA COMPARISON ERROR ***
check_file(..., 4442856, 9927, I:107540:ip-10-52-148-199:doio*, 31, 0) failed

Corrupt regions follow - unprintable chars are represented as '.'
-----------------------------------------------------------------
corrupt bytes starting at file offset 4442856
    1st 32 expected bytes:  I:107540:ip-10-52-148-199:doio*I
    1st 32 actual bytes:    ........................9:doio*I

Server/FUSE-side timeline (backend I/O)

  • Prior to memcpy, a 12KB write of old data persisted:
2025/11/12 17:12:45.968617 [DEBUG] file.go:425: TRACE Write enter: ino(8508829) ... offset(4440064) len(12288) flags(WriteCache) ...
2025/11/12 17:12:45.977674 [DEBUG] file.go:532: TRACE Write: ino(8508829) offset(4440064) len(12288) ... (8999822)ns
  • Read path uses window-aligned requests (128KB), may start earlier than user offset; hole handling logs:
2025/11/12 17:12:45.968765 [DEBUG] file.go:357: TRACE Read enter: ... offset(4395008) reqsize(131072) ... ffl=OpenReadOnly+OpenSync)
2025/11/12 17:12:45.982158 [DEBUG] stream_reader.go:189: ... req FileOffset(4395008) Size(45056) ExtentKey(<nil>)
2025/11/12 17:12:45.982179 [DEBUG] stream_reader.go:206: Stream read hole: ino(8508829) req(FileOffset(4395008) Size(45056) ExtentKey(<nil>)) total(45056)
  • The 12KB region [4440064, 4452352) is delivered before verify-read:
2025/11/12 17:12:45.982426 [DEBUG] extent_reader.go:114: ExtentReader Read exit: req(FileOffset(4440064) Size(12288) ...) ... readBytes(12288) err(<nil>)
  • Only a single 4KB overwrite (new data) is observed right after:
2025/11/12 17:12:45.982545 [DEBUG] stream_conn.go:228: sendToConn exit: ... Op(OpRandomWrite) ... FileOffset(4440064) Size(4096) ...
2025/11/12 17:12:45.982569 [DEBUG] stream_writer.go:436: Streamer write exit: ... offset(4440064) size(4096) done total(4096) err(<nil>)
  • A separate example of verify-read at higher offset still triggering an earlier aligned window:
[2025-11-12 17:12:45.968535747] verify-read: fd=4 offset=5072258 ... len=124935 ret=124935 ... duration_ns=1880012
2025/11/12 17:12:45.968765 [DEBUG] file.go:357: TRACE Read enter: ... offset(4395008) reqsize(131072) ...

Why this looks like a “page tear”

  • 4KB page size. User-space mmap write: S=4442856, L=9927, end inclusive E=S+L-1=4452782.
  • Spans 4 pages:
    • P0 [4440064, 4444160) → tail 1304B
    • P1 [4444160, 4448256) → full 4096B
    • P2 [4448256, 4452352) → full 4096B
    • P3 [4452352, 4456448) → head 431B
  • Before memcpy, old data for [4440064, 4452352) (P0–P2) was persisted.
  • The msync window that followed only resulted in a single 4KB overwrite (P0), while P1/P2/P3 were not updated in that window.
  • The immediate verify-read then observed “new P0 tail + old P1/P2 + old P3 head”, matching the compare output where a long run of '.' (old) is followed by the new tail “9:doio*I”.

Why verify-read time “doesn’t align” with backend time

  • check_file() uses a separate O_RDONLY fd and does a single lseek+read (not reading the writer’s mmap). Its short syscall time reflects copying data from cache.
  • The backend read had already populated the needed 12KB into cache before verify-read was issued, due to window-aligned prefetch/extent reads.

Impact

  • Immediate data verification after msync(MS_SYNC) can read partially updated content (mixed old/new) for the same range, causing false data corruption failures in mmap workloads.

What might be happening

  • Race between:
    • Background/streamed writeback of old content (already completed for P0–P2 before memcpy)
    • The subsequent msync only resulting in one 4KB flush (observed)
  • Page-granular flushing and request coalescing lead to only a subset of dirty pages being persisted in that specific window, making an immediate read see a torn range.

Requests for maintainers

  • Confirm semantics: After msync(..., MS_SYNC) on MAP_SHARED, should CubeFS guarantee the full dirty range is persisted atomically to the backend before returning?
  • Investigate:
    • Ensuring msync-triggered flush covers the entire dirty range intersected, not just a single page
    • Coordination between FUSE writeback and mmap msync to avoid persisting stale data that races with a newer mmap write
    • Read path’s window alignment is expected, but verify that it cannot cache stale data over freshly dirtied pages post-msync
  • If full atomicity is not guaranteed, document recommended barriers/flags for mmap writers to ensure strong persistence before verification.

Possible workarounds (for the test)

  • Delay verification slightly after msync to allow remaining pages to flush
  • Verify on page-aligned, page-sized boundaries or use fsync on the fd after msync
  • Avoid immediate verification of sub-page tails/heads that straddle pages

clinx avatar Nov 13 '25 06:11 clinx