vision
vision copied to clipboard
`Label.from_category` is O(num_classes) complexity, for every single sample
This function is called when loading every single sample, but its complexity is O(num_classes) because of the .index() call, which is probably overkill. A simple dict lookup should be enough and much faster here.
https://github.com/pytorch/vision/blob/96aa3d928c6faaf39defd846966421679244412d/torchvision/prototype/features/_label.py#L35-L43
CC @pmeier
I agree that there is a theoretical benefit, but does this an impact on a practical case? Benchmarking it on my machine gives me roughly 3 ns per category in categories. Meaning even for ImageNet (the dataset with the most categories we currently have on record) it takes roughly 3 µs for this call:
import random
import timeit
num_categories = 1_000 # maximum that we currently have
num_runs = 1_000_000
categories = list(range(num_categories))
total = timeit.timeit(lambda: categories.index(random.randint(0, num_categories - 1)), number=num_runs)
print(f"Looking up the category by index with {num_categories} total categories took {total / num_runs * 1e6:.1f} µs")
Your proposal means we should switch catgories attribute to a dictionary with the categories as keys and the index as value, correct? Do you want to keep the functionality to create a label with a categories sequence?
Categories are strings, not integers. String comparison will take much longer. I did observe a significant difference last time I benchmarked this
Makes no significant difference for me:
import random
from time import perf_counter_ns
import string
num_categories = 1_000 # maximum that we currently have
num_runs = 1_000_000
categories = ["".join(random.choices(string.ascii_lowercase, k=20)) for _ in range(num_categories)]
timings = []
for _ in range(num_runs):
category = random.choice(categories)
tic = perf_counter_ns()
categories.index(category)
tac = perf_counter_ns()
timings.append((tac - tic) * 1e-9)
print(
f"Looking up the category by index with {num_categories} total categories took "
f"{sum(timings) / num_runs * 1e6:.1f} µs"
)
With string of 20 characters lookup is now linear with 4 ns, i.e. 4 µs for ImageNet.
I did observe a significant difference last time I benchmarked this
Could you share your benchmark?
import random
import string
str_len = 20
num_categories = 1000
batch_size = 512
categories = [''.join(random.choices(string.ascii_lowercase, k=str_len)) for _ in range(num_categories)]
mapping = {cat:i for (i, cat) in enumerate(categories)}
batch_targets = [random.choice(categories) for _ in range(batch_size)]
%%timeit
[categories.index(cat) for cat in batch_targets]
2.59 ms ± 62.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
[mapping[cat] for cat in batch_targets]
14 µs ± 318 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
A few ms per batch is always good to take, especially considering how simple the solution is, and when we're trying to exhibit benefits of using datapipes over our current datasets. Data-loading speed will be the main factor for adoption.
Do you want to keep the functionality to create a label with a categories sequence?
Sorry, not sure what you mean here?
Right now we create the Label with a sequence of categories:
https://github.com/pytorch/vision/blob/2e70ee1af3bb3c5ba0f9a8bcf16cd1611831690f/torchvision/prototype/features/_label.py#L18
https://github.com/pytorch/vision/blob/2e70ee1af3bb3c5ba0f9a8bcf16cd1611831690f/torchvision/prototype/features/_label.py#L40
Do we want to keep this and only internally use dictionaries or should we switch to dictionaries everywhere? That would also mean that we need to touch the info dictionary of each dataset that provides categories:
https://github.com/pytorch/vision/blob/2e70ee1af3bb3c5ba0f9a8bcf16cd1611831690f/torchvision/prototype/datasets/_builtin/fer2013.py#L12-L17
I don't know yet, honestly. But I'm open to removing this categories logic altogether. We're still in prototype stage and our main focus is loading speed right now.