natural icon indicating copy to clipboard operation
natural copied to clipboard

unable to open .../WNdb/dict/index.adv

Open tommedema opened this issue 11 years ago • 16 comments

When I perform the following a couple of million times I get the error unable to open .../WNdb/dict/index.adv:

function isWord(text, cb) {
    wordnet.lookup(text, function(results) {
        cb(Array.isArray(results) && results.length > 0);
    });
}

Is there anything I can do to resolve this?

tommedema avatar Dec 23 '13 19:12 tommedema

@tommedema I'm having a hard time pinpointing this (mainly because these wordnet lookups are really slow). It did crash for me but i didn't find the same error.

In the meantime, is it essential to use wordnet for this? I hate to be pushing this all the time to people but the Trie is basically purpose-built for the 'isWord' test. I've built my trie using this code and you can do is something like this:

trie.contains(word);

To get a synchronous and lightning fast answer.

Here are some times (not rock solid benchmarks, but you get the idea) for comparison:

Using WordNet: crashed after about 4k, after a minute or two Using Trie: 23,588,700 lookups ~39 seconds

Threw the code into a gist if you want to check it out https://gist.github.com/kkoch986/9899177

In the meantime, i'll tag this as a bug.

Thanks, -Ken

kkoch986 avatar Mar 31 '14 18:03 kkoch986

Thanks, I am no longer working on this issue.

tommedema avatar Mar 31 '14 18:03 tommedema

Ok no problem, just curious did you manage to resolve the error?

kkoch986 avatar Mar 31 '14 20:03 kkoch986

I didn't :)

tommedema avatar Mar 31 '14 20:03 tommedema

As already mentioned, WordNet module is bare bones and notoriously underperforming except for simple lookups. You may want to look at https://github.com/moos/wordpos, built on top of natural's WordNet, with optimized perfermance using additional fast-index files and cached disk reads.

Although for simple isWord operations, I agree @kkoch986's suggestion might be better.

moos avatar May 03 '14 13:05 moos

Actually going to close this unless someone else runs into a similar problem. I think its safe to say the WNdb code directly is not best used this way, @moos I haven't had a chance to try the new wordpos module but it looks pretty cool thanks for the tip!

kkoch986 avatar May 05 '14 13:05 kkoch986

I encounter the same error using the wordnet.lookup(word, cb) API. If I wait a few seconds I get the same error for data.adv. Both index.adv and data.adv exist on disk at the reported location and are readable under the current user.

Edit: some more debugging: this appears to be a problem with too many open file handles:

{ [Error: EMFILE, open '/home/aaron/blah/blah/node_modules/WNdb/dict/index.adv']
  errno: 20,
  code: 'EMFILE',
  path: '/home/aaron/blah/blah/node_modules/WNdb/dict/index.adv' }

http://stackoverflow.com/questions/8965606/node-and-error-emfile-too-many-open-files

it looks like index_file.js and data_file.js may not be appropriately calling back the file close callback in their, um, callback...

Ouch, this is thorny: even if the file handles are in theory being (eventually) closed, because the API is async, we can have potentially unlimited pending calls, and therefore opened file handles. Given I'm looking up hundreds, possibly thousands, of synonyms, this is likely the case :(

ahamid avatar Oct 14 '14 15:10 ahamid

natural opens the index file on each lookup, and if you've got thousands of simultaneous lookups, that's how many open files you'll have. wordpos is optimized for multiple async reads and is much faster. You could combine wordpos' getPOS() or isX() method with its lookupX() for better performance than natural's lookup().

moos avatar Oct 14 '14 18:10 moos

Going to reopen this, it seems like the issue is around the files not being closed correctly, I'll try to dig in further and see if i can find anything.

kkoch986 avatar Oct 15 '14 13:10 kkoch986

It turns out Bluebird promises library map function supports a concurrency flag that can limit pending promises, so I used that to work around this problem.

Promise.map words, ((word) => @lookupWordNetInfo word), concurrency: 10

ahamid avatar Oct 15 '14 16:10 ahamid

@ahamid so do you think the problem is too many concurrent calls? that could potentially explain why too many files are open at once.

kkoch986 avatar Oct 17 '14 20:10 kkoch986

@kkoch986 Yeah, I'm pretty sure that's the case (well, I haven't proved the opposite - that files aren't eventually getting closed, but code looked fine on casual inspection). It's just the tradeoff of using an async-only api. I did not get around to using wordpos since the map trick did the job, which has become my go-to hammer for this sort of thing.

ahamid avatar Oct 23 '14 02:10 ahamid

Yea i think its worth a closer look, maybe an option to just load the thing into memory. no reason to keep reading it from files every time anyway, especially if your doing a large amount of lookups.

kkoch986 avatar Oct 23 '14 13:10 kkoch986

So just to give everyone the latest news on this, I am kicking off a rewrite of the natural wordnet layer which should result in cleaner code and better performance. Hopefully in the next few weeks i'll have something to show for this and we can finally close this issue

kkoch986 avatar Jan 22 '15 01:01 kkoch986

I think wordpos already solves this problem -- not only that its 'fastIndex' provides 30x performance boost over natural's WordNet methods. I'm happy to contribute any or all parts of wordpos's code to this effort, either as a rewrite, a sub-module, or drop-in plugin. If you go the wholesale rewrite route, I'm afraid it'll break wordpos since it was built on top of the WordNet module's API.

moos avatar Feb 08 '15 18:02 moos

@moos see #211 and #170 the plan is to reimplement for performance/stability while maintaining the base API.

Theres a good chance we will build more functionality on top of the basic API but the main plan is to at least stabilize the code using the same API and move the wordnet downloading to an in-library corpus manager. Would love to have your input on this whole effort as well, I'm just getting into the actual wordnet files and coming up with a plan for indexing them more efficiently.

kkoch986 avatar Feb 09 '15 05:02 kkoch986