tools-python icon indicating copy to clipboard operation
tools-python copied to clipboard

`create_list_without_duplicates` Function Can be Sped Up By Using Set

Open jspeed-meyers opened this issue 1 year ago • 0 comments

The current implementation of create_list_without_duplicates includes a relatively expensive check of if an element is in a list.

https://github.com/spdx/tools-python/blob/8050fd9c41a92c75ec2ba9eb10ed9a919c375fa9/src/spdx_tools/spdx/document_utils.py#L51-L57

Using a set to keep track of unique elements will speed up this function.

I did a time comparison between this current function and a function that uses a set.

from typing import List, Any
import timeit

# Original implementation
def create_list_without_duplicates_original(list_with_potential_duplicates: List[Any]) -> List[Any]:
    list_without_duplicates = []
    for element in list_with_potential_duplicates:
        if element not in list_without_duplicates:
            list_without_duplicates.append(element)
    return list_without_duplicates

# Optimized implementation
def create_list_without_duplicates_optimized(list_with_potential_duplicates: List[Any]) -> List[Any]:
    seen_elements = set()
    list_without_duplicates = []
    for element in list_with_potential_duplicates:
        if element not in seen_elements:
            seen_elements.add(element)
            list_without_duplicates.append(element)
    return list_without_duplicates

# Test data
test_data = list(range(1000)) * 10

# Measure time for the original implementation
original_time = timeit.timeit(lambda: create_list_without_duplicates_original(test_data), number=1000)

# Measure time for the optimized implementation
optimized_time = timeit.timeit(lambda: create_list_without_duplicates_optimized(test_data), number=1000)

print(f"Original implementation time: {original_time:.1f} seconds")
print(f"Optimized implementation time: {optimized_time:.1f} seconds")

The results are:

Original implementation time: 35.6seconds
Optimized implementation time: 0.2seconds

That original function is about 178x slower. I'll put in a PR for discussion.

jspeed-meyers avatar Feb 04 '24 01:02 jspeed-meyers