httpx icon indicating copy to clipboard operation
httpx copied to clipboard

fix: prevent response file overwrite when -sd flag is used

Open jjhwan-h opened this issue 5 months ago • 2 comments

Summary

This PR addresses issue #2152, which causes response files to be overwritten even when the -sd (SkipDedupe) flag is used.

What was the issue?

When the -sd flag is enabled, input targets are processed without deduplication. However, the responses are still written to the same file path based on the SHA1 hash of the URL, leading to overwritten output files.

What’s changed?

  • When SkipDedupe is enabled and a response file with the same name already exists, the response will be written to a new file with a suffix (_1, _2, ...).
  • Repeated input targets are now counted using HybridMap, and the number of processing iterations is determined based on this count.
  • Modified countTargetFromRawTarget to return a known duplicateTargetErr when deduplication is disabled.
  • Refactored response writing logic in analyze and process to ensure unique file writes under concurrency.

Why is this useful?

This change ensures:

  • Accurate tracking and storage of multiple response outputs for identical input targets.
  • Prevents unintentional data loss due to file overwrites.
  • Honors the intent behind the -sd flag.

Result

# test.txt
localhost:8000
localhost:9000
localhost:8000
localhost:9000
localhost:8000
localhost:9000
localhost:8000
localhost:9000
$./httpx/httpx -l test.txt -stream -skip-dedupe -sr
$tree output/response/                                                                                                                            
output/response/                                                                                                                                                                         
├── index.txt
├── localhost_8000
│   ├── 59bd7616010ed02cd66f44e94e9368776966fe3b.txt
│   ├── 59bd7616010ed02cd66f44e94e9368776966fe3b_1.txt
│   ├── 59bd7616010ed02cd66f44e94e9368776966fe3b_2.txt
│   └── 59bd7616010ed02cd66f44e94e9368776966fe3b_3.txt
└── localhost_9000
    ├── 981d6875d791d0a1a28393b5ec62d61dff1e977f.txt
    ├── 981d6875d791d0a1a28393b5ec62d61dff1e977f_1.txt
    ├── 981d6875d791d0a1a28393b5ec62d61dff1e977f_2.txt
    └── 981d6875d791d0a1a28393b5ec62d61dff1e977f_3.txt

2 directories, 9 files
$ ./httpx/httpx -l test.txt -skip-dedupe -sr
 $ tree output/response/
output/response/
├── index.txt
├── localhost_8000
│   ├── 59bd7616010ed02cd66f44e94e9368776966fe3b.txt
│   ├── 59bd7616010ed02cd66f44e94e9368776966fe3b_1.txt
│   ├── 59bd7616010ed02cd66f44e94e9368776966fe3b_2.txt
│   └── 59bd7616010ed02cd66f44e94e9368776966fe3b_3.txt
└── localhost_9000
    ├── 981d6875d791d0a1a28393b5ec62d61dff1e977f.txt
    ├── 981d6875d791d0a1a28393b5ec62d61dff1e977f_1.txt
    ├── 981d6875d791d0a1a28393b5ec62d61dff1e977f_2.txt
    └── 981d6875d791d0a1a28393b5ec62d61dff1e977f_3.txt

2 directories, 9 files
$./httpx/httpx -l test.txt -sr
output/
└── response
    ├── index.txt
    ├── localhost_8000
    │   └── 59bd7616010ed02cd66f44e94e9368776966fe3b.txt
    └── localhost_9000
        └── 981d6875d791d0a1a28393b5ec62d61dff1e977f.txt

3 directories, 3 files
$ ./httpx/httpx -l test.txt -stream -sr
$ tree output/
output/
└── response
    ├── index.txt
    ├── localhost_8000
    │   └── 59bd7616010ed02cd66f44e94e9368776966fe3b.txt
    └── localhost_9000
        └── 981d6875d791d0a1a28393b5ec62d61dff1e977f.txt

3 directories, 3 files

Related issue

Closes #2152

Summary by CodeRabbit

  • New Features

    • Duplicate targets are now explicitly detected and — when configured — processed multiple times.
    • Response saving uses exclusive creation and auto-increments filenames on collisions to ensure uniqueness; screenshot saving unchanged.
  • Bug Fixes

    • Improved duplicate-target counting and propagation to ensure accurate processing.
    • Removed on-the-fly response saving from the primary output path to reduce conflicts.
  • Refactor

    • Per-target processing reworked to honor per-target counts and stream mode.
  • Tests

    • Updated tests to verify duplicate detection and handling.

jjhwan-h avatar Jul 31 '25 06:07 jjhwan-h

Walkthrough

Adds explicit duplicate detection and counting for input targets, propagates duplicates when deduplication is disabled, refactors per-target processing into a repeatable runner that executes per-target count times, changes response-saving to exclusive-create with incremental suffixes to avoid overwrites, and updates tests to assert the new duplicate error behavior.

Changes

Cohort / File(s) Change Summary
Duplicate detection & counting
runner/runner.go
Introduced duplicateTargetErr; countTargetFromRawTarget returns (0, duplicateTargetErr) for existing targets; prepareInput and loadAndCloseFile now increment stored counts and total targets when Options.SkipDedupe is set.
Per-target processing flow
runner/runner.go
Refactored RunEnumeration to use a runProcess(times) helper; non-stream targets are executed according to stored per-target counts; stream mode runs once.
Response file persistence & naming
runner/runner.go
Moved response-saving out of the main overwrite path; response files are created with exclusive-create (O_EXCL) and, on name collisions, retried with incrementing suffixes (_1, _2, ...) before .txt to guarantee unique filenames.
Tests & error handling
runner/runner_test.go
Added github.com/pkg/errors import; test initializes Options.SkipDedupe appropriately; duplicate assertions updated to errors.Is(err, duplicateTargetErr).

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Runner
    participant FS as FileSystem

    Note over Runner: Input parsing & counting
    User->>Runner: supply targets (may include duplicates) + options
    Runner->>Runner: countTargetFromRawTarget(raw)
    alt target exists
        Runner-->>Runner: returns duplicateTargetErr
        alt SkipDedupe enabled
            Runner->>Runner: increment stored count for target
        else
            Runner->>Runner: ignore duplicate (no count increase)
        end
    else new target
        Runner->>Runner: store target with count=1
    end

    Note over Runner,FS: Processing phase (per-target repeats)
    loop i = 1..count
        Runner->>Runner: runProcess(1) — perform enumeration for target
        Runner->>FS: attempt create response file (O_EXCL)
        alt creation collision
            FS->>FS: compute filename with suffix (_1, _2, ...)
            FS->>Runner: return newly created file
        else success
            FS->>Runner: file created
        end
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

I nibble duplicates, one by one,
Count each hop till counting's done.
Files get tails when names collide,
No more overwrites — bounce with pride.
A rabbit cheers for safer strides 🐇✨

Pre-merge checks and finishing touches

✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The PR title "fix: prevent response file overwrite when -sd flag is used" directly and concisely summarizes the main objective of the pull request. The title clearly identifies the problem being addressed (preventing response file overwrites) and the specific context (when the -sd flag, which enables SkipDedupe, is used). This matches the core fix described in the PR objectives and directly relates to the changes in runner.go that implement duplicate target counting and unique filename generation to prevent overwrites. The title is specific, clear, and avoids vague terminology.
Linked Issues Check ✅ Passed The code changes directly implement the requirements from linked issue #2152, which asks that repeated input targets with the -sd (SkipDedupe) flag should produce multiple response files instead of overwriting a single file. The implementation includes a sentinel duplicateTargetErr to detect duplicate targets when SkipDedupe is enabled, counting of repeated targets via a map structure, refactored RunEnumeration logic to process targets multiple times based on their count, and unique filename generation with incrementing suffixes (_1, _2, etc.) to prevent file collisions. These changes collectively ensure that when -sd is used, each processed instance of a repeated input target produces its own response file with a unique name, matching the expected behavior shown in issue #2152.
Out of Scope Changes Check ✅ Passed The code changes appear focused on the scope of issue #2152. While the PR includes refactoring of response-writing logic and removal of on-the-fly HTTP response saving from the primary output path, these changes are explicitly mentioned in the PR objectives as necessary structural changes to ensure thread-safe unique file writing under concurrency. This refactoring appears intentional and directly supports the fix by moving response-writing to a centralized location where duplicate handling and unique naming logic can be properly applied. The PR objectives confirm this removal is part of the broader refactoring to prevent overwrites, and note that screenshot saving behavior is preserved, indicating the changes are deliberate and scoped to solving the core issue.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing touches
  • [ ] 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • [ ] Create PR with unit tests
  • [ ] Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bd4dc827b252e4819a6c7ac456fee90be68ad1ab and a3212c6c0c1fb7967704445eb794804a6f8dda13.

📒 Files selected for processing (1)
  • runner/runner.go (5 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: Functional Test (macOS-latest)
  • GitHub Check: Functional Test (windows-latest)
  • GitHub Check: Functional Test (ubuntu-latest)
  • GitHub Check: Analyze (go)
  • GitHub Check: Lint Test
🔇 Additional comments (4)
runner/runner.go (4)

480-480: LGTM: Sentinel error for duplicate detection.

The sentinel error is well-named and follows standard Go error handling patterns.


674-677: LGTM: Explicit duplicate detection.

Returning duplicateTargetErr when a target already exists in the map enables callers to distinguish duplicates from other errors, which is essential for the SkipDedupe feature.


2235-2258: LGTM: Exclusive file creation with unique suffixes.

The exclusive creation logic using os.O_EXCL correctly prevents race conditions and overwrites. The incrementing suffix approach ensures each duplicate target produces a unique response file, which is the core requirement of this PR.

The infinite loop is acceptable here since:

  • Non-exist errors break immediately (line 2255)
  • The number of duplicates is finite and controlled by user input
  • The atomic O_EXCL operation makes this concurrency-safe

1316-1338: No issues found. The original review comment is incorrect.

The code is safe from the concern raised. In non-stream mode (line 1349), r.hm.Scan() iterates only over keys that exist in the map, so the r.hm.Get(k) lookup at line 1332 is guaranteed to succeed. In stream mode, the map is not consulted—runProcess(1) runs directly. The suggested defensive fallback is unnecessary and would introduce unreachable code.

Likely an incorrect or invalid review comment.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot] avatar Jul 31 '25 06:07 coderabbitai[bot]

Should the entries in the index file match the files under the output/response/ directory?

Currently, it looks like a new index file is created for every request. As a result, when requests are repeated, the index file and the files inside the response/ folder become inconsistent.

Round 1

output/response/
├── index.txt
└── localhost_8000
    ├── 59bd7616010ed02cd66f44e94e9368776966fe3b.txt
    └── 59bd7616010ed02cd66f44e94e9368776966fe3b_1.txt

#index.txt
/discovery/httpx/output/response/localhost_8000/59bd7616010ed02cd66f44e94e9368776966fe3b.txt http://localhost:8000 (200 OK)
/discovery/httpx/output/response/localhost_8000/59bd7616010ed02cd66f44e94e9368776966fe3b_1.txt http://localhost:8000 (200 OK)

Round 2

output/response/
├── index.txt
└── localhost_8000
    ├── 59bd7616010ed02cd66f44e94e9368776966fe3b.txt
    ├── 59bd7616010ed02cd66f44e94e9368776966fe3b_1.txt
    ├── 59bd7616010ed02cd66f44e94e9368776966fe3b_2.txt
    └── 59bd7616010ed02cd66f44e94e9368776966fe3b_3.txt

#index.txt
/discovery/httpx/output/response/localhost_8000/59bd7616010ed02cd66f44e94e9368776966fe3b_2.txt http://localhost:8000 (200 OK)
/discovery/httpx/output/response/localhost_8000/59bd7616010ed02cd66f44e94e9368776966fe3b_3.txt http://localhost:8000 (200 OK)

jjhwan-h avatar Oct 24 '25 06:10 jjhwan-h