annoy Parallelize build?

Conceptually, if the trees are independent, shouldn't we be able to build the trees in parallel across multiple cores? Is this something that could be implemented easily?

I haven't had a chance to dive deeply into the code/algorithm yet so I may be misunderstanding something.

May 29 '15 18:05 ravimody

It wouldn't be too hard – you would need a mutex in a couple of places and also watch out for reallocations, but other than that it should be easy

May 29 '15 18:05 erikbern

Cool thanks, I'll look into it (although my C++ is a little rusty :))

Jun 01 '15 21:06 ravimody

Has there been any further work on this?

Nov 17 '17 20:11 thomas4g

not afaik :(

Nov 17 '17 21:11 erikbern

No work on it from my end.

Nov 17 '17 21:11 ravimody

Bummer. I'd love to help out, but I don't think I'm familiar enough with annoy yet (or frankly competent enough with C++) to spearhead anything. If anyone starts working on this and wants a hand, I'm happy to help out.

Nov 17 '17 23:11 thomas4g

Actually, I take it back. It looks like a pretty simple change. If I understand it properly, it's line 496 of annoylib.h:

_roots.push_back(_make_tree(indices));

That can get parallelized, but with a mutex around _roots to avoid simultaneous push_back calls, right?

Nov 17 '17 23:11 thomas4g

roughly, but it's a bit more complicated._make_tree also modifies the datastructure. You also need a mutex around line 643-644 and line 702-703. Maybe that mutex could just be folded into _allocate_size and we could rename the function to something like _add_item ... probably makes more sense

then you obviously need to create pthreads and collect them afterwards... honestly I haven't done that in like 10 years, but iirc it's not too hard

Nov 18 '17 03:11 erikbern

@erikbern Actually I was working on a branch to add this just this weekend. It doesn't quite work yet though (I can only get it to run the precision_test.cpp example, only a debug binary works and it crashes sometimes even then); should I submit a PR?

I didn't touch anything towards the bottom of annoylib.h (near those lines that you mentioned), that might be the problem...

Nov 20 '17 11:11 tjrileywisc

sure, feel free to submit a PR, just make sure to put "WIP" in the subject or something

you definitely need a mutex around those lines, that's probably the issue :)

Nov 20 '17 13:11 erikbern

@tjrileywisc what a coincidence! I also started working on this over the weekend. Mine also doesn't quite work... but I've pushed a copy to my fork: https://github.com/thomas4g/annoy/tree/parallelize_build

I hope to keep working on it tonight, but if you're further along let me know if you'd like any help! Feel free to ping me here or shoot me an email [email protected]

Nov 20 '17 16:11 thomas4g

@thomas4g Just took a peak at your fork - I'm actually using std::thread instead of pthread . I think Windows doesn't have support for pthread built in. std::thread should be available for gcc and MSVC.

Nov 20 '17 16:11 tjrileywisc

@tjrileywisc ahh, I wanted to use that but got thrown off by the build not supporting my #include <mutex> by default. I didn't want to fiddle with build settings.

Nov 20 '17 16:11 thomas4g

Just submitted a PR for my build_trees_threaded branch.

https://github.com/spotify/annoy/pull/246

Nov 21 '17 01:11 tjrileywisc

guys, is there any news on the multicore index building?

Jul 01 '19 21:07 denkuzin

I thought I might have a go at this. The actual threading machinery is all straightforward but I'm having some trouble seeing exactly which bits of _make_tree() modify common data and need to be guarded with a mutex. As per previous comments I wrapped some functionality into a thread safe _add_item() function but I'm still getting segfaults indicating data is getting modified out from under my threads somewhere else.

S _add_item() {
    std::lock_guard<std::mutex> guard(_mutex);
    _allocate_size(_n_nodes + 1);
    S item = _n_nodes++;
    return item;
}

Any help with this is appreciated.

Nov 25 '19 17:11 os-gabe

interesting – you're probably right that it should be fairly straightforward, but i can't think of other critical sections off the top of my head

Nov 25 '19 18:11 erikbern

Unless I'm misunderstanding the code this may be more complicated than I first thought. It appears that there are many places within _make_tree that touch the underlying data. For example _get would need a mutex since it could be attempting to access _nodes while another thread is calling _add_item. But worse, since it returns a pointer, anywhere that pointer is used would also need a mutex. I believe this is looking like you would wind up needing mutexes around quite a large portion of _make_tree which would mostly undo the benefits of parallelizing it. It's possible I'm still not understanding the code fully though.

Nov 25 '19 19:11 os-gabe

From what you say I think the main issue is that the underlying memory can be reallocated at any point in time, and that invalidates any pointers held by any other thread.

But those reallocations are actually pretty rare so there should be some way to fix.

I haven't dealt with concurrency code in C++ since maybe 2007 so my knowledge is a bit rusty but couldn't you use a shared lock for this? Almost all access will be nonexclusive (so near-zero overhead), but the few times when you need to reallocate the underlying storage, you would have to acquire an exclusive lock. Does that make sense?

Nov 25 '19 23:11 erikbern

I meant a shared mutex. This looks like the right concurrency primitive: https://en.cppreference.com/w/cpp/thread/shared_mutex

So basically acquire a shared lock when writing individual vectors, acquire an exclusive lock when you have to resize the underlying data storage.

But I'm mostly speculating, could be wrong :)

Nov 26 '19 05:11 erikbern

I think what you are saying makes sense. Thanks for the tip on shared mutex - I hadn't seen that before. It was introduced in c++17 though which may be it's own problem.

I'll take another look and see what I can figure out.

Nov 26 '19 16:11 os-gabe

@os-gabe Hey, Is it solved? Any insights on how to parallelize build?

Jan 27 '20 12:01 chikubee

no, this would have to be implemented by someone

Jan 27 '20 13:01 erikbern

@erikbern That's sad. I am trying to build an index for over 1M vectors and it is crashing even with on_disk_build. The process took up more than 30 GB memory and crashed.

Jan 27 '20 14:01 chikubee

i'm not sure if parallelization would have helped, though. do you know what's causing it to crash?

Jan 27 '20 15:01 erikbern

Yeah agreed that parallelization probably would not help.

1M vectors isn't that much unless they are extremely high dimensional vectors. Did you try with a smaller number of vectors to see where it starts failing?

Jan 27 '20 15:01 ravimody

good point – annoy isn't meant for super high dimensionality, so if that's what you're facing then you should probably run dimensionality reduction outside of annoy first!

Jan 27 '20 15:01 erikbern

@os-gabe Hey, Is it solved? Any insights on how to parallelize build?

Unfortunately I had to move on to other things and did not get parallel build working

Jan 27 '20 18:01 os-gabe

annoy annoy copied to clipboard

Parallelize build?

annoy
annoy copied to clipboard