tools-python
tools-python copied to clipboard
`create_list_without_duplicates` Function Can be Sped Up By Using Set
The current implementation of create_list_without_duplicates
includes a relatively expensive check of if an element is in a list.
https://github.com/spdx/tools-python/blob/8050fd9c41a92c75ec2ba9eb10ed9a919c375fa9/src/spdx_tools/spdx/document_utils.py#L51-L57
Using a set to keep track of unique elements will speed up this function.
I did a time comparison between this current function and a function that uses a set.
from typing import List, Any
import timeit
# Original implementation
def create_list_without_duplicates_original(list_with_potential_duplicates: List[Any]) -> List[Any]:
list_without_duplicates = []
for element in list_with_potential_duplicates:
if element not in list_without_duplicates:
list_without_duplicates.append(element)
return list_without_duplicates
# Optimized implementation
def create_list_without_duplicates_optimized(list_with_potential_duplicates: List[Any]) -> List[Any]:
seen_elements = set()
list_without_duplicates = []
for element in list_with_potential_duplicates:
if element not in seen_elements:
seen_elements.add(element)
list_without_duplicates.append(element)
return list_without_duplicates
# Test data
test_data = list(range(1000)) * 10
# Measure time for the original implementation
original_time = timeit.timeit(lambda: create_list_without_duplicates_original(test_data), number=1000)
# Measure time for the optimized implementation
optimized_time = timeit.timeit(lambda: create_list_without_duplicates_optimized(test_data), number=1000)
print(f"Original implementation time: {original_time:.1f} seconds")
print(f"Optimized implementation time: {optimized_time:.1f} seconds")
The results are:
Original implementation time: 35.6seconds
Optimized implementation time: 0.2seconds
That original function is about 178x slower. I'll put in a PR for discussion.