gitingest icon indicating copy to clipboard operation
gitingest copied to clipboard

Add Flag to Automatically Exclude .gitignore

Open ArmanJR opened this issue 8 months ago • 10 comments

This pull request introduces a new CLI flag (--use-gitignore) that enhances Gitingest by automatically loading and applying ignore patterns from all .gitignore files found in the target repository or directory. When enabled, files and directories matching any pattern specified in any .gitignore are excluded from the generated text digest.

Key Changes:

  • CLI Update:

    • Modified src/gitingest/cli.py to add a new option --use-gitignore that accepts a boolean value.
    • Updated the main() function to pass the new flag to the asynchronous ingestion entry point.
  • Ingestion Entry Point:

    • Updated src/gitingest/entrypoint.py to include a new parameter use_gitignore.
    • Integrated a call to the new helper function load_gitignore_patterns() (from src/gitingest/utils/ignore_patterns.py) to update the query’s ignore patterns with all patterns extracted from .gitignore files.
  • Gitignore Loader:

    • Implemented load_gitignore_patterns() in src/gitingest/utils/ignore_patterns.py, which recursively searches for .gitignore files starting from the repository root and aggregates their ignore patterns.
    • Added comprehensive docstrings to the loader function to adhere to coding style guidelines.
  • Testing:

    • Created new tests in tests/test_gitignore_feature.py to verify that:
      • With --use-gitignore enabled, files matching .gitignore patterns are excluded from the digest.
      • Without the flag, all files are included.
    • Fixed linting and formatting issues in tests and source files.

This feature provides a seamless way to respect repository-level ignore rules, ensuring that the generated digest is more relevant for ingestion by large language models. It improves usability by reducing the need for manual pattern exclusions and aligns the tool’s behavior more closely with Git’s own ignore logic.

ArmanJR avatar Apr 05 '25 17:04 ArmanJR

@ArmanJR Thanks for your contribution, I've looked at the code and it looks OK, I would have to run some tests myself

In order to merge this I think we would need to reflect those changes in the front-end so gitingest keeps 1-1 features no matter from where it is accessed

Here's a rough sketch of how it could look like: image

Do you think you can handle this or do you want us to help you with that?

cyclotruc avatar Apr 09 '25 13:04 cyclotruc

I'll try :)

ArmanJR avatar Apr 10 '25 16:04 ArmanJR

Very useful feature! I’d suggest enabling it by default and renaming the flag to something like --no-gitignore to ensure a safer default behavior.

neerax avatar Apr 15 '25 09:04 neerax

@cyclotruc I believe having the checkbox on UI is redundant, as the files on a GitHub repo are already ignored if mentioned in the .gitignore. The intuition behind my PR is having the ability to ignore files when running gitingest on a local repo via CLI.

ArmanJR avatar Apr 18 '25 19:04 ArmanJR

@cyclotruc bump!

ArmanJR avatar May 22 '25 00:05 ArmanJR

Hi, sorry for the delay i've been busy but will come back to this soon, thanks again for your patience

cyclotruc avatar May 22 '25 02:05 cyclotruc

This will be a great feature. For me I expected this behaviour by default and was surprised to find some credentials in the digest.txt after running on a local project.

zazencodes avatar May 22 '25 13:05 zazencodes

Also, there is another issue in local gitingest ./ calling. When you call it for the second time, it includes the previous digest.txt in the new digest file:

~/code/Go/sandbox ···········································································································  10:39:58 AM
❯ echo "boz hi" > main.go
~/code/Go/sandbox ···········································································································  10:40:18 AM
❯ cat main.go
boz hi
~/code/Go/sandbox ···········································································································  10:40:21 AM
❯ gitingest ./
Analysis complete! Output written to: digest.txt

Summary:
Repository: ./
Files analyzed: 1

Estimated tokens: 29
❯ cat digest.txt
Directory structure:
└── .//
    └── main.go

================================================
File: /main.go
================================================
boz hi

~/code/Go/sandbox ···········································································································  10:40:32 AM
❯ gitingest ./
Analysis complete! Output written to: digest.txt

Summary:
Repository: ./
Files analyzed: 2

Estimated tokens: 73
~/code/Go/sandbox ···········································································································  10:40:37 AM
❯ cat digest.txt
Directory structure:
└── .//
    ├── digest.txt
    └── main.go

================================================
File: /digest.txt
================================================
Directory structure:
└── .//
    └── main.go

================================================
File: /main.go
================================================
boz hi




================================================
File: /main.go
================================================
boz hi

I believe since users usually don't double-check the content of digest.txt, it's better to ignore digest.txt by default.

ArmanJR avatar May 22 '25 14:05 ArmanJR

@ArmanJR The tests are failing. Can you have a look at it?

filipchristiansen avatar Jun 21 '25 23:06 filipchristiansen

@ArmanJR Thank you for the contribution We're interested in this feature, do you think you want to continue working on it or should we take it from there? Happy to help if you want to finish

cyclotruc avatar Jun 22 '25 23:06 cyclotruc

Of course, I'd be happy to help. Is there anything else that should be implemented?

ArmanJR avatar Jun 23 '25 02:06 ArmanJR

Suggestion on flag semantics & naming

  1. Respect .gitignore by default. Most developer-facing CLIs (ripgrep, fd, etc.) do this because it’s safer (no secrets or build artefacts leak).

  2. Invert the flag so users opt in when they genuinely want the extra noise. Two workable spellings: • --no-gitignore (mirrors ripgrep --no-ignore) • --include-gitignored (reads like “please pull in the files that are normally ignored”).

  3. Whatever name we choose, the default should be True so existing scripts keep working and only the rare cases need the override:

    # normal – git-ignored files skipped
    gitingest
    
    # exceptional – include everything
    gitingest --no-gitignore
    
  4. Implementation note: using the pathspec library would give us full Git-wildmatch coverage (negations, **, order-aware precedence) practically for free. We might also want to reimplement the _should_include and _should_exclude functions that currently use fnmatch with pathspec.

filipchristiansen avatar Jun 23 '25 09:06 filipchristiansen

Thanks a lot @ArmanJR!

Just a follow up question: What was the reason for the version (>=0.12.1) dependency of pathspec>=0.12.1?

filipchristiansen avatar Jun 25 '25 11:06 filipchristiansen

@filipchristiansen You mean why 0.12.1? I think 0.12.0 had a bug

ArmanJR avatar Jun 25 '25 20:06 ArmanJR