pr-agent icon indicating copy to clipboard operation
pr-agent copied to clipboard

Improved lookup performance by converting 'bad_extensions' from a list to a set

Open dst03106 opened this issue 11 months ago • 3 comments

User description

  • Modified the data structure for bad_extensions (currently more than 58 items) from List to Set
  • Implemented caching for the bad_extensions Set
  • Improved time complexity for file extension lookup from O(n) to O(1)
def is_valid_file(filename:str, bad_extensions=None) -> bool:
    ...
    return filename.split('.')[-1] not in bad_extensions # <-- lookup 
image

PR Type

Enhancement


Description

  • Converted bad_extensions from list to cached set for O(1) lookups

  • Added a helper function to manage and cache the set

  • Updated file filtering and validation logic to use the cached set


Changes walkthrough 📝

Relevant files
Enhancement
language_handler.py
Refactor bad extension logic to use cached set                     

pr_agent/algo/language_handler.py

  • Introduced _get_bad_extensions_set to cache and manage bad extensions
    as a set
  • Replaced list-based extension checks with set-based checks for
    efficiency
  • Updated filter_bad_extensions and is_valid_file to use the cached set
  • Improved performance of file extension lookups from O(n) to O(1)
  • +14/-7   

    Need help?
  • Type /help how to ... in the comments thread for any questions about Qodo Merge usage.
  • Check out the documentation for more information.
  • dst03106 avatar May 03 '25 06:05 dst03106

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    ⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
    🧪 No relevant tests
    🔒 No security concerns identified
    ⚡ Recommended focus areas for review

    Missing Type Hint

    The filter_bad_extensions function is missing type hints for its parameter and return value, which would improve code consistency since other functions have type annotations.

    def filter_bad_extensions(files):
        # Bad Extensions, source: https://github.com/EleutherAI/github-downloader/blob/345e7c4cbb9e0dc8a0615fd995a08bf9d73b3fe6/download_repo_text.py  # noqa: E501
        bad_extensions = _get_bad_extensions_set()
        return [f for f in files if f.filename is not None and is_valid_file(f.filename, bad_extensions)]
    
    Potential Race Condition

    The global cache implementation might cause issues in multi-threaded environments as there's no synchronization mechanism when checking and updating the cache.

    def _get_bad_extensions_set() -> set[str]:
        global _cached_bad_extensions_set
    
        if _cached_bad_extensions_set is None:
            _cached_bad_extensions_set = set(get_settings().bad_extensions.default)
            if get_settings().config.use_extra_bad_extensions:
                _cached_bad_extensions_set.update(set(get_settings().bad_extensions.extra))
        return _cached_bad_extensions_set
    

    PR Code Suggestions ✨

    Explore these optional code suggestions:

    CategorySuggestion                                                                                                                                    Impact
    General
    Cache constant data

    The auto_generated_files list is recreated on every function call, which is
    inefficient. Consider moving this list outside the function as a constant or
    using a cached set similar to the bad extensions approach.

    pr_agent/algo/language_handler.py [25-32]

    +_AUTO_GENERATED_FILES = ['package-lock.json', 'yarn.lock', 'composer.lock', 'Gemfile.lock', 'poetry.lock']
    +
     def is_valid_file(filename:str, bad_extensions=None) -> bool:
         if not filename:
             return False
         if not bad_extensions:
             bad_extensions = _get_bad_extensions_set()
    -    auto_generated_files = ['package-lock.json', 'yarn.lock', 'composer.lock', 'Gemfile.lock', 'poetry.lock']
    -    for forbidden_file in auto_generated_files:
    +    for forbidden_file in _AUTO_GENERATED_FILES:
             if filename.endswith(forbidden_file):
    
    • [ ] Apply this suggestion <!-- /improve --apply_suggestion=0 -->
    Suggestion importance[1-10]: 6

    __

    Why: The suggestion correctly points out that the auto_generated_files list is recreated on each call to is_valid_file. Moving it outside the function as a constant improves efficiency and aligns with good coding practices.

    Low
    Learned
    best practice
    Add null safety checks

    Add null safety checks for get_settings().bad_extensions.default and
    get_settings().bad_extensions.extra to prevent potential runtime errors if these
    values are None or not properly initialized.

    pr_agent/algo/language_handler.py [9-16]

     def _get_bad_extensions_set() -> set[str]:
         global _cached_bad_extensions_set
     
         if _cached_bad_extensions_set is None:
    -        _cached_bad_extensions_set = set(get_settings().bad_extensions.default)
    -        if get_settings().config.use_extra_bad_extensions:
    -            _cached_bad_extensions_set.update(set(get_settings().bad_extensions.extra))
    +        settings = get_settings()
    +        default_extensions = settings.bad_extensions.default or []
    +        _cached_bad_extensions_set = set(default_extensions)
    +        if settings.config.use_extra_bad_extensions:
    +            extra_extensions = settings.bad_extensions.extra or []
    +            _cached_bad_extensions_set.update(set(extra_extensions))
         return _cached_bad_extensions_set
    
    • [ ] Apply this suggestion <!-- /improve --apply_suggestion=1 -->
    Suggestion importance[1-10]: 6
    Low
    • [ ] More <!-- /improve --more_suggestions=true -->
    • [ ] Author self-review: I have reviewed the PR code suggestions, and addressed the relevant ones.

    Hi @dst03106

    Global variable is a bad practice. It's ill-suited for a multi-thread environment, and can create multiple problems.

    The optimization presented in this PR is minor, and doesn't justify a global variable.

    mrT23 avatar May 04 '25 13:05 mrT23

    @mrT23 Thanks for the feedback!

    From what I understood, it seemed like the context was using asyncio, so I didn’t consider multi-threaded environments when writing this. Based on that, I assumed pr-agent wasn’t running in a multi-threaded context.

    That said, you’re right — it’s better to write code that’s safe even in those environments. But since the performance gain is quite minor, I’ll go ahead and close this PR.

    Thanks again for pointing it out!

    dst03106 avatar May 07 '25 07:05 dst03106