M2 icon indicating copy to clipboard operation
M2 copied to clipboard

replace gdbm by cdb

Open DanGrayson opened this issue 8 years ago • 16 comments
trafficstars

cdb is a faster database suite than gdbm, which we use, so replace it. See the benchmarks at https://gist.github.com/epitron/1625d93d0b82c32e7395

DanGrayson avatar Apr 24 '17 23:04 DanGrayson

Other good contenders:

  • https://capnproto.org
  • https://msgpack.org
  • protobuff

They all integrate with almost any language, which would allow using it for other purposes, like pickling math objects and unpickling them in the browser, or in another math software.

mahrud avatar May 30 '20 23:05 mahrud

It would also be nice if the database files were architecture-independent. Then, in the Debian package, we could move them over from the architecture-dependent macaulay2 package to the architecture-dependent macaulay2-common. We could just build the latter on amd64, and not have to worry about generating all of the examples on every single architecture.

d-torrance avatar Dec 03 '20 20:12 d-torrance

It would also be nice if the database files were architecture-independent. Then, in the Debian package, we could move them over from the architecture-dependent macaulay2 package to the architecture-dependent macaulay2-common. We could just build the latter on amd64, and not have to worry about generating all of the examples on every single architecture.

I don't think they're really architecture dependent in that sense. Only the endianness is different. But yes, we might as well move the common package to Macaulay2-doc

mahrud avatar Dec 03 '20 20:12 mahrud

Currently I'm weighing pros and cons of the following two:

  • https://dbmx.net/tkrzw/
  • https://github.com/google/leveldb

Both are modern database managers with multithreaded support and process-locking, plus many more niceties.

The three suggestions I gave earlier in this thread are still good add-ons to M2 for serializing the hypertext objects before storing them (rather than using toExternalString), so I'm wondering which combination is easiest to implement.

mahrud avatar Dec 05 '20 22:12 mahrud

It would also be nice if the database files were architecture-independent. Then, in the Debian package, we could move them over from the architecture-dependent macaulay2 package to the architecture-dependent macaulay2-common. We could just build the latter on amd64, and not have to worry about generating all of the examples on every single architecture.

I don't think they're really architecture dependent in that sense. Only the endianness is different. But yes, we might as well move the common package to Macaulay2-doc

If the endianness causes the files to differ, then they are architecture dependent and can't be moved over.

DanGrayson avatar Dec 05 '20 23:12 DanGrayson

Looks like both Tkrzw and LevelDB are architecture independent: I asked their authors specifically about word size and endian-ness in https://github.com/estraier/tkrzw/issues/3#issuecomment-739469365 and https://github.com/google/leveldb/issues/856#issuecomment-739710458

mahrud avatar Dec 07 '20 07:12 mahrud

leveldb is already in Debian (https://tracker.debian.org/pkg/leveldb), so that would be my preference. :) I also did a couple quick experiments and confirmed that if I create a leveldb database on my amd64 machine, I can read it in an i386 chroot without problems.

d-torrance avatar Dec 08 '20 01:12 d-torrance

Excellent! LevelDB is definitely more set up, but tkrzw is the successor (released Oct 2020!) to Tokyo Cabinet and Kyoto Cabinet, both of which are in Debian, so I presume it'll land there soon. Compiling it is really quick, so it wouldn't slow things down, and it had the least number of prerequisites. Also, having both a HashDBM and TreeDBM seems like an attractive feature for our purposes.

That said, I like both of their APIs, both based on C++17 and POSIX.

mahrud avatar Dec 08 '20 01:12 mahrud

FYI, I've been working on getting Macaulay2 to work with leveldb and can confirm that endianness doesn't appear to matter. (I was a little concerned about the "pretty sure" in https://github.com/google/leveldb/issues/856#issuecomment-739710458 lol!) After copying the raw documentation database for FirstPackage from my amd64 machine to a Debian s390x porterbox:

i1 : x = openDatabase "~/rawdocumentation-leveldb.db"

o1 = /home/dtorrance/rawdocumentation-leveldb.db

o1 : Database

i2 : x#"FirstPackage"

o2 = new HashTable from {Headline => "an example Macaulay2 package",
     "linenum" => 53, "filename" =>
     "/home/profzoom/src/macaulay2/M2/M2/Macaulay2/packages/FirstPackage.m2",
     Description => 1:(DIV{PARA{TEX{"",EM{"FirstPackage"}," is a basic                                                                                        
     package to be used as an example."}}}), Key => FirstPackage, symbol
     DocumentTag => new DocumentTag from
     {"FirstPackage","FirstPackage","FirstPackage"}, Caveat =>
     DIV{HEADER2{"Caveat"},DIV{PARA{TEX{"Still trying to figure this                                                                                          
     out."}}}}, Subnodes => MENU{TO{new DocumentTag from
     {"firstFunction","firstFunction","FirstPackage"}}}}

i3 : x#"firstFunction"

o3 = new HashTable from {"linenum" => 53, symbol DocumentTag => new
     DocumentTag from {"firstFunction","firstFunction","FirstPackage"},
     PrimaryTag => new DocumentTag from
     {(firstFunction,ZZ),"firstFunction(ZZ)","FirstPackage"}, "filename" =>
     "/home/profzoom/src/macaulay2/M2/M2/Macaulay2/packages/FirstPackage.m2"}

i4 : version#"machine"

o4 = s390x-Linux-Debian-11

i5 : version#"endianness"

o5 = abcd

d-torrance avatar Jul 14 '21 13:07 d-torrance

Good news.

DanGrayson avatar Jul 14 '21 14:07 DanGrayson

D'oh, I should have done a little more research before I started coding -- leveldb doesn't support multiple processes. My current draft works just fine as long as only one M2 process is running, but as soon as you fire up a second one (e.g., when generating examples...):

Macaulay2, version 1.18.0.1
../../Macaulay2/m2/packages.m2:348:40:(1):[36]: error: IO error: lock /home/profzoom/src/macaulay2/M2/M2/BUILD/doug/usr-dist/common/share/doc/Macaulay2/Parsing/cache/rawdocumentation-leveldb/LOCK: Resource temporarily unavailable : /home/profzoom/src/macaulay2/M2/M2/BUILD/doug/usr-dist/common/share/doc/Macaulay2/Parsing/cache/rawdocumentation-leveldb

I don't think there's any way around this -- from their documentation:

A database may only be opened by one process at a time.

d-torrance avatar Jul 15 '21 03:07 d-torrance

Oops, too bad.

DanGrayson avatar Jul 15 '21 11:07 DanGrayson

Doesn't leveldb support waiting for the database to be unlocked? Tkrzw allows multiple readers:

Tokyo Cabinet provides two modes to connect to a database: "reader" and "writer". A reader can perform retrieving but neither storing nor deleting. A writer can perform all access methods. Exclusion control between processes is performed when connecting to a database by file locking. While a writer is connected to a database, neither readers nor writers can be connected. While a reader is connected to a database, other readers can be connect, but writers can not. According to this mechanism, data consistency is guaranteed with simultaneous connections in multitasking environment.

mahrud avatar Jul 20 '21 17:07 mahrud

Doesn't leveldb support waiting for the database to be unlocked?

The current behavior is to open the raw documentation database when a package is loaded, so the databases for each of the packages that are loaded by default would always be locked by the first M2 process.

Tkrzw allows multiple readers:

Tokyo Cabinet provides two modes to connect to a database: "reader" and "writer". A reader can perform retrieving but neither storing nor deleting. A writer can perform all access methods. Exclusion control between processes is performed when connecting to a database by file locking. While a writer is connected to a database, neither readers nor writers can be connected. While a reader is connected to a database, other readers can be connect, but writers can not. According to this mechanism, data consistency is guaranteed with simultaneous connections in multitasking environment.

It looks like that behavior was changed when Tokyo Cabinet became Tkrzw. From https://dbmx.net/tkrzw/#tips_multitasking:

However, they are not designed to be shared among multiple processes. When one process opens a database file as a writer, an exclusive lock is applied to the file so that other processes trying to open the same database file are blocked until the database is closed.

That page does suggest a workaround:

A workaround way to share the same database among multiple processes is that each operation to access the database is done between opening the database and closing the database.

So I suppose one solution would be to only open the raw documentation database when a user actually runs help, and then close it right afterward.


By the way, I also played around with using TinyCDB. Its file format is architecture-independent and it supports multiple processes, but using it would change the behavior of how Macaulay2 Database objects work. In particular, reading from a database while you're currently creating it isn't possible. So for example, this code from examples Database would fail:

     filename = temporaryFileName () | ".dbm"
     x = openDatabaseOut filename
     x#"first" = "hi there"
     x#"first"

d-torrance avatar Jul 21 '21 12:07 d-torrance

It looks like that behavior was changed when Tokyo Cabinet became Tkrzw.

Alright, I believe Tokyo Cabinet is already packaged for debian and brew.

The current behavior is to open the raw documentation database when a package is loaded, so the databases for each of the packages that are loaded by default would always be locked by the first M2 process.

Tokyo Cabinet uses mmap for loading the buckets into memory, so it should be pretty cheap to open, read, and close a database whenever needed.

Also, we should use a single database file on each application directory, see #1643.

mahrud avatar Jul 21 '21 14:07 mahrud

Another contender is LMDB. Reading the website, it seems to check all the boxes.

mahrud avatar Feb 22 '22 01:02 mahrud