vision `Label.from_category` is O(num_classes) complexity, for every single sample

This function is called when loading every single sample, but its complexity is O(num_classes) because of the .index() call, which is probably overkill. A simple dict lookup should be enough and much faster here.

https://github.com/pytorch/vision/blob/96aa3d928c6faaf39defd846966421679244412d/torchvision/prototype/features/_label.py#L35-L43

CC @pmeier

Aug 04 '22 09:08 NicolasHug

I agree that there is a theoretical benefit, but does this an impact on a practical case? Benchmarking it on my machine gives me roughly 3 ns per category in categories. Meaning even for ImageNet (the dataset with the most categories we currently have on record) it takes roughly 3 µs for this call:

import random
import timeit

num_categories = 1_000  # maximum that we currently have
num_runs = 1_000_000

categories = list(range(num_categories))

total = timeit.timeit(lambda: categories.index(random.randint(0, num_categories - 1)), number=num_runs)

print(f"Looking up the category by index with {num_categories} total categories took {total / num_runs * 1e6:.1f} µs")

Your proposal means we should switch catgories attribute to a dictionary with the categories as keys and the index as value, correct? Do you want to keep the functionality to create a label with a categories sequence?

Aug 10 '22 09:08 pmeier

Categories are strings, not integers. String comparison will take much longer. I did observe a significant difference last time I benchmarked this

Aug 10 '22 09:08 NicolasHug

Makes no significant difference for me:

import random
from time import perf_counter_ns
import string

num_categories = 1_000  # maximum that we currently have
num_runs = 1_000_000

categories = ["".join(random.choices(string.ascii_lowercase, k=20)) for _ in range(num_categories)]

timings = []

for _ in range(num_runs):
    category = random.choice(categories)

    tic = perf_counter_ns()
    categories.index(category)
    tac = perf_counter_ns()

    timings.append((tac - tic) * 1e-9)

print(
    f"Looking up the category by index with {num_categories} total categories took "
    f"{sum(timings) / num_runs * 1e6:.1f} µs"
)

With string of 20 characters lookup is now linear with 4 ns, i.e. 4 µs for ImageNet.

I did observe a significant difference last time I benchmarked this

Could you share your benchmark?

Aug 10 '22 10:08 pmeier

import random
import string

str_len = 20
num_categories = 1000
batch_size = 512

categories = [''.join(random.choices(string.ascii_lowercase, k=str_len)) for _ in range(num_categories)]
mapping = {cat:i for (i, cat) in enumerate(categories)}
batch_targets = [random.choice(categories) for _ in range(batch_size)]

%%timeit
[categories.index(cat) for cat in batch_targets]

2.59 ms ± 62.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
[mapping[cat] for cat in batch_targets]

14 µs ± 318 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

A few ms per batch is always good to take, especially considering how simple the solution is, and when we're trying to exhibit benefits of using datapipes over our current datasets. Data-loading speed will be the main factor for adoption.

Do you want to keep the functionality to create a label with a categories sequence?

Sorry, not sure what you mean here?

Aug 10 '22 10:08 NicolasHug

Right now we create the Label with a sequence of categories:

https://github.com/pytorch/vision/blob/2e70ee1af3bb3c5ba0f9a8bcf16cd1611831690f/torchvision/prototype/features/_label.py#L18

https://github.com/pytorch/vision/blob/2e70ee1af3bb3c5ba0f9a8bcf16cd1611831690f/torchvision/prototype/features/_label.py#L40

Do we want to keep this and only internally use dictionaries or should we switch to dictionaries everywhere? That would also mean that we need to touch the info dictionary of each dataset that provides categories:

https://github.com/pytorch/vision/blob/2e70ee1af3bb3c5ba0f9a8bcf16cd1611831690f/torchvision/prototype/datasets/_builtin/fer2013.py#L12-L17

Aug 10 '22 11:08 pmeier

I don't know yet, honestly. But I'm open to removing this categories logic altogether. We're still in prototype stage and our main focus is loading speed right now.

Aug 10 '22 11:08 NicolasHug

vision vision copied to clipboard

`Label.from_category` is O(num_classes) complexity, for every single sample

vision
vision copied to clipboard