natural
natural copied to clipboard
Pushing multithreading/standalone workers upstream
I've been working on a couple major changes/additions to the project and I have a couple questions for the core maintainers:
Question 1
I'm implementing multi threading in the classifier train()
method using the Node cluster API but this isn't supported in Node <0.8
and as this project seems to support >=0.4
(according to the package.json
) this would obviously be a problem for pushing my multithreading changes upstream, so is there a plan to depreciate old Node support in place? Maybe I could push these changes into a new major release version with that depreciation?
I realise stopping this support might be out of the question, but I think pushing this upstream could be really beneficial, the consequences for this resource heavy operation are obvious considering Node's inherent single threaded model.
Question 2
I'm also creating a separate project for standalone 'classification workers', whereby an app running the core natural
project could add train()
jobs to a queue (any *MQ) and then a worker in a 'classification farm' could take them and run them in the new multithreaded environment. All of this persisting state via a central data store (a redis cluster, in my use case, but it could really be anything that can store JSON e.g. CouchDB or Mongo).
This is obviously going to need to include changes to the core project to allow the formation and usage of these queues, my question is: is this something you guys are interested in seeing pushed back upstream? I'm fairly sure I can abstract it to such a degree that it will not break the API at all, and fall back to the normal single threaded model as it currently stands.
Also, the workers are going to be a separate project that I'm going to maintain, but I've come across projects that don't like upstream changes being made to support downstream/userland add-ons like this, and prefer these 'add-ons' to monkey patch code at runtime (or some other method). Is that a philosophical problem you guys would have?
I realise all of this is completely academic without code to look at (coming for multithreading tomorrow or Friday and workers next week) but I want to know where I stand with possibly having to maintain my own 'multithreaded/worker' fork if you guys don't want these changes!
If no one objects, i don't mind bumping up the minimum version of node to 0.8 on the next release (v0.8 is almost 2 years old already).
With regards to the second issue (and somewhat the first), It would be nice to have these performance features but it sounds like they would require some additional effort by developers and add some complexity to the API for the classifier.
I think these sound like great features, but it would be nice if there were a way to still access the existing API as well. It would help if you could give some idea of how extensive the modifications to the "core" project would be.
Thanks, -Ken
- Yes, let's go to 0.8 for 0.1.28
2 yeah, that sounds VERY interesting. Honestly if it's possible I'd prefer you not monkey patch, but that's totally dependent on the scope of your changes. We'd not want to break you in the future.
It'd be cool to get a clearer vision of what you're doing.
Is it already possible to use multicore training? I have a MacPro, but training only uses 1 cpu.