M2
M2 copied to clipboard
replace gdbm by cdb
cdb is a faster database suite than gdbm, which we use, so replace it. See the benchmarks at https://gist.github.com/epitron/1625d93d0b82c32e7395
Other good contenders:
- https://capnproto.org
- https://msgpack.org
- protobuff
They all integrate with almost any language, which would allow using it for other purposes, like pickling math objects and unpickling them in the browser, or in another math software.
It would also be nice if the database files were architecture-independent. Then, in the Debian package, we could move them over from the architecture-dependent macaulay2 package to the architecture-dependent macaulay2-common. We could just build the latter on amd64, and not have to worry about generating all of the examples on every single architecture.
It would also be nice if the database files were architecture-independent. Then, in the Debian package, we could move them over from the architecture-dependent
macaulay2package to the architecture-dependentmacaulay2-common. We could just build the latter on amd64, and not have to worry about generating all of the examples on every single architecture.
I don't think they're really architecture dependent in that sense. Only the endianness is different. But yes, we might as well move the common package to Macaulay2-doc
Currently I'm weighing pros and cons of the following two:
- https://dbmx.net/tkrzw/
- https://github.com/google/leveldb
Both are modern database managers with multithreaded support and process-locking, plus many more niceties.
The three suggestions I gave earlier in this thread are still good add-ons to M2 for serializing the hypertext objects before storing them (rather than using toExternalString), so I'm wondering which combination is easiest to implement.
It would also be nice if the database files were architecture-independent. Then, in the Debian package, we could move them over from the architecture-dependent
macaulay2package to the architecture-dependentmacaulay2-common. We could just build the latter on amd64, and not have to worry about generating all of the examples on every single architecture.I don't think they're really architecture dependent in that sense. Only the endianness is different. But yes, we might as well move the common package to Macaulay2-doc
If the endianness causes the files to differ, then they are architecture dependent and can't be moved over.
Looks like both Tkrzw and LevelDB are architecture independent: I asked their authors specifically about word size and endian-ness in https://github.com/estraier/tkrzw/issues/3#issuecomment-739469365 and https://github.com/google/leveldb/issues/856#issuecomment-739710458
leveldb is already in Debian (https://tracker.debian.org/pkg/leveldb), so that would be my preference. :) I also did a couple quick experiments and confirmed that if I create a leveldb database on my amd64 machine, I can read it in an i386 chroot without problems.
Excellent! LevelDB is definitely more set up, but tkrzw is the successor (released Oct 2020!) to Tokyo Cabinet and Kyoto Cabinet, both of which are in Debian, so I presume it'll land there soon. Compiling it is really quick, so it wouldn't slow things down, and it had the least number of prerequisites. Also, having both a HashDBM and TreeDBM seems like an attractive feature for our purposes.
That said, I like both of their APIs, both based on C++17 and POSIX.
FYI, I've been working on getting Macaulay2 to work with leveldb and can confirm that endianness doesn't appear to matter. (I was a little concerned about the "pretty sure" in https://github.com/google/leveldb/issues/856#issuecomment-739710458 lol!) After copying the raw documentation database for FirstPackage from my amd64 machine to a Debian s390x porterbox:
i1 : x = openDatabase "~/rawdocumentation-leveldb.db"
o1 = /home/dtorrance/rawdocumentation-leveldb.db
o1 : Database
i2 : x#"FirstPackage"
o2 = new HashTable from {Headline => "an example Macaulay2 package",
"linenum" => 53, "filename" =>
"/home/profzoom/src/macaulay2/M2/M2/Macaulay2/packages/FirstPackage.m2",
Description => 1:(DIV{PARA{TEX{"",EM{"FirstPackage"}," is a basic
package to be used as an example."}}}), Key => FirstPackage, symbol
DocumentTag => new DocumentTag from
{"FirstPackage","FirstPackage","FirstPackage"}, Caveat =>
DIV{HEADER2{"Caveat"},DIV{PARA{TEX{"Still trying to figure this
out."}}}}, Subnodes => MENU{TO{new DocumentTag from
{"firstFunction","firstFunction","FirstPackage"}}}}
i3 : x#"firstFunction"
o3 = new HashTable from {"linenum" => 53, symbol DocumentTag => new
DocumentTag from {"firstFunction","firstFunction","FirstPackage"},
PrimaryTag => new DocumentTag from
{(firstFunction,ZZ),"firstFunction(ZZ)","FirstPackage"}, "filename" =>
"/home/profzoom/src/macaulay2/M2/M2/Macaulay2/packages/FirstPackage.m2"}
i4 : version#"machine"
o4 = s390x-Linux-Debian-11
i5 : version#"endianness"
o5 = abcd
Good news.
D'oh, I should have done a little more research before I started coding -- leveldb doesn't support multiple processes. My current draft works just fine as long as only one M2 process is running, but as soon as you fire up a second one (e.g., when generating examples...):
Macaulay2, version 1.18.0.1
../../Macaulay2/m2/packages.m2:348:40:(1):[36]: error: IO error: lock /home/profzoom/src/macaulay2/M2/M2/BUILD/doug/usr-dist/common/share/doc/Macaulay2/Parsing/cache/rawdocumentation-leveldb/LOCK: Resource temporarily unavailable : /home/profzoom/src/macaulay2/M2/M2/BUILD/doug/usr-dist/common/share/doc/Macaulay2/Parsing/cache/rawdocumentation-leveldb
I don't think there's any way around this -- from their documentation:
A database may only be opened by one process at a time.
Oops, too bad.
Doesn't leveldb support waiting for the database to be unlocked? Tkrzw allows multiple readers:
Tokyo Cabinet provides two modes to connect to a database: "reader" and "writer". A reader can perform retrieving but neither storing nor deleting. A writer can perform all access methods. Exclusion control between processes is performed when connecting to a database by file locking. While a writer is connected to a database, neither readers nor writers can be connected. While a reader is connected to a database, other readers can be connect, but writers can not. According to this mechanism, data consistency is guaranteed with simultaneous connections in multitasking environment.
Doesn't leveldb support waiting for the database to be unlocked?
The current behavior is to open the raw documentation database when a package is loaded, so the databases for each of the packages that are loaded by default would always be locked by the first M2 process.
Tkrzw allows multiple readers:
Tokyo Cabinet provides two modes to connect to a database: "reader" and "writer". A reader can perform retrieving but neither storing nor deleting. A writer can perform all access methods. Exclusion control between processes is performed when connecting to a database by file locking. While a writer is connected to a database, neither readers nor writers can be connected. While a reader is connected to a database, other readers can be connect, but writers can not. According to this mechanism, data consistency is guaranteed with simultaneous connections in multitasking environment.
It looks like that behavior was changed when Tokyo Cabinet became Tkrzw. From https://dbmx.net/tkrzw/#tips_multitasking:
However, they are not designed to be shared among multiple processes. When one process opens a database file as a writer, an exclusive lock is applied to the file so that other processes trying to open the same database file are blocked until the database is closed.
That page does suggest a workaround:
A workaround way to share the same database among multiple processes is that each operation to access the database is done between opening the database and closing the database.
So I suppose one solution would be to only open the raw documentation database when a user actually runs help, and then close it right afterward.
By the way, I also played around with using TinyCDB. Its file format is architecture-independent and it supports multiple processes, but using it would change the behavior of how Macaulay2 Database objects work. In particular, reading from a database while you're currently creating it isn't possible. So for example, this code from examples Database would fail:
filename = temporaryFileName () | ".dbm"
x = openDatabaseOut filename
x#"first" = "hi there"
x#"first"
It looks like that behavior was changed when Tokyo Cabinet became Tkrzw.
Alright, I believe Tokyo Cabinet is already packaged for debian and brew.
The current behavior is to open the raw documentation database when a package is loaded, so the databases for each of the packages that are loaded by default would always be locked by the first M2 process.
Tokyo Cabinet uses mmap for loading the buckets into memory, so it should be pretty cheap to open, read, and close a database whenever needed.
Also, we should use a single database file on each application directory, see #1643.
Another contender is LMDB. Reading the website, it seems to check all the boxes.