openai-cookbook icon indicating copy to clipboard operation
openai-cookbook copied to clipboard

Enhancements and Refactoring of Python Code Extraction Methods

Open eli64s opened this issue 2 years ago • 1 comments
trafficstars

PR Title: Enhancements and Refactoring of Python Code Extraction Methods

PR Description: This pull request introduces enhancements and refactoring to the Code_search.ipynb script, which is used for extracting Python functions and generating their text embeddings. The proposed modifications not only make the script more efficient and user-friendly, but also ease the process of future maintenance.

Included updates:

  1. Normalization of File Paths: Update code to use the relative_to() method from pathlib.Path. The previous string.replace() function produces inconsistent results, as it could potentially replace substrings not part of the root directory. To illustrate this, please consider the example below:

    import pandas as pd
    from pathlib import Path
    
    data = {'file_path': [
        'repo/main/src/file1_copy/other/repo/main/src/file1',
        'repo/main/src/file1_copy/file1',
    ]}
    df = pd.DataFrame(data)
    
    # Approach 1: Path.relative_to()
    root_dir = Path('repo/main/src')
    df['Path.relative_to()'] = df['file_path'].map(lambda x: Path(x).relative_to(root_dir))
    
    # Approach 2: string.replace()
    root_dir = 'repo/main/src'
    df['str.replace()'] = df['file_path'].apply(lambda x: x.replace(root_dir, ''))
    
    file_path Path.relative_to() string.replace()
    0 repo/main/src/file1_copy/other/repo/main/src/file1 file1_copy/other/repo/main/src/file1 /file1_copy/other//file1
    1 repo/main/src/file1_copy/file1 file1_copy/file1 /file1_copy/file1

    As seen above, Path.relative_to() provides accurate relative path computation, considering the file structure and ensuring correct results, even in cases where the base directory appears elsewhere in the file path.

  2. Capture async def: Code file searching now extracts both def and async def methods.

  3. Refactor get_functions: now handles files using a context manager for safer and more reliable file operations.

  4. Refactor get_until_no_space: Update logic to prevent potential index out of range errors.

  5. Improve Directory Search: Update code to use pathlib.Path.glob() to search for files vs. the original os.walk() and glob() methods. The os.walk() method traverses the directory tree recursively, generating a tuple for each directory it encounters. The pathlib.Path.glob() method performs the file search directly, without generating intermediate results. This can lead to improved performance, as the search is more efficient and consumes less memory.

  6. Implement extract_functions_from_repo: Encompasses the logic of code file function extraction and printing.

These changes collectively enhance the functionality and maintainability of the script, providing better support for future development and analysis tasks involving the openai-cookbook repository.

Best Regards, Eli

eli64s avatar May 28 '23 05:05 eli64s

I will try to review this week. Thanks for the detailed and high-quality contribution!

ted-at-openai avatar May 30 '23 20:05 ted-at-openai

By the way, really appreciate you taking the time to describe and document your improvements. Always love to see it. :)

ted-at-openai avatar Jul 12 '23 00:07 ted-at-openai