M2
M2 copied to clipboard
open database files
Loading a package results in opening its raw documentation database file and leaving it open for future access. But the default number of open files for a process can be as small as 256, and we have something close to 170 packages now, so loading them all, just to see what's in them, as is done to generate a list of all the packages with their headlines for the documentation, can result in using a lot of file descriptors.
It might be better to wait until the documentation for the loaded package is needed and then to load the entire database into memory, if the number of nodes is small enough. (Macaulay2Doc has more than 5000 nodes, so we want not to load it into memory.)
Maybe the thing to do is to invent a datatype that implements a FIFO queue, to contain the open database files. Each time one is used, remove it and add it to the queue again. Each time the queue gets to a size of 200 or so, remove the first one and close it. Each time a database is encountered that is closed, reopen it and add it to the queue.
An LRU cache might be a better choice. Where should this go?
Yes, that's exactly the term for what I described.
The database file is stored in an object of class Package under the key "raw documentation database", so searching for that string in the files in the directory M2/Macaulay2/m2 will locate all the uses of those databases.
We have 193 packages now.
This might be a silly question, but why do we store documentation in databases? Why not just text files?
Speed.
gdbm:
i4 : time help Macaulay2Doc
-- used 0.0410353 seconds
...
vs. man:
[mahrud@noether ~]$ time man bash
...
sys 0m0.124s
Is 0.06s worth the effort?
I don't understand the point of your timing comparison -- we don't have man pages for the Macaulay2 documentation.
The point is that just reading from a file for each documentation node is just as fast.
To be clear, I'm not suggesting this is what we should do and I can't even do proper experiments to compare or just go in and fix this now because I can't make sense of the code and all the places that databases pop up. This is just in response to you question about what experiments I did that tell me the speedup is not significant.
Okay -- so the speedup might be significant, after all. I'll do a proper comparison. It would be great if the speedup were insignficant now.
I just wanted to chime in and point out that (1) the speed up is not insignificant — the 0.083s absolute difference translates to 3x slowdown; (2) but on that scale and in that context it's probably irrelevant; (3) however, the original problem remains and it seems much easier to just insert an LRU cache instead of reworking documentation organization.
The size of man bash was about 100x that of help "Macaulay2Doc", which is actually nonexistant .. bad examples.
Is there a disadvantage to having a single database instead of one per package?
That's a great idea!
To distinguish the items from various packages, one should prepend the name of the package to the documentation key -- that will be straightforward.
Then one has to decide what to do with the packages installed by the user in the user's application directory. Have a database for all of them, I guess. Same for any other directory where the user installs packages, and for any directory on the prefixPath:
i2 : stack prefixPath
o2 = /Users/dan/Library/Application Support/Macaulay2/local/
/Users/dan/src/M2/M2.git/M2/BUILD/dan/builds.tmp/einsteinium-development/usr-dist/
The routine uninstallPackage could remove the appropriate entries, too, since the name of the package is a prefix to the key.
The prefixPath is short, so it doesn't matter if those database files stay open all the time.