The Problem

There is often a need on Google App Engine to have a simple read-only database or key value store. This comes up when one wants to store a large dictionary of information which does not need to be updated very often. Examples include timezone data, list of state or country capitals, annual financial information, or any database that only needs to be updated once every few years.

Storing this information in a python dictionary is an option, but would be memory inefficient, especially if used infrequently. Storing it in the GAE datastore can be complex to maintain. Storing it in the GAE memcache is not acceptable because it is transient.

The Solution

A simpler solution is to build a database offline, then upload it to GAE, and then access the database on GAE in read only mode. This commit allows a user of semidbm to do just that.

How

Google App Engine runs python in a sandbox, and does not support os file descriptor access that is used by semidbm. It also does not support opening files for writing, but it does support read only access.

This commit changes those function calls to the standard python open, read, and write functions, which are supported in the GAE sandbox. Further, if the semidbm database is opened in read-only mode, then the appropriate flags are used which work on GAE.

I have run the tests on this PR, and they all pass.

Nov 12 '15 03:11 speedplane

Looks like I need to do some fixing for python 3. Please let me know if this PR is generally acceptable and I'll clean up the python 3 issues.

Nov 12 '15 03:11 speedplane

I believe the main reason the lower level os functions were used was because of performance. There's a scripts/benchmark script that will measure performance for various operations. More info here: http://semidbm.readthedocs.org/en/latest/benchmarks.html

I would not be opposed to this change provided the benchmark numbers from that script don't change significantly.

Also, I'm not really familiar with google app engine. You're saying that open(...) works but import os; os.open(...) is not allowed? Could you point me to any documentation about this (I couldn't find any)? I'd like to just get a little more background info on the scope of restrictions for GAE so I understand the problem better. Thanks.

Nov 12 '15 20:11 jamesls

Yes, I can confirm that open works and os.open does not work. I don't know why this is, but there is probably a good reason for it (perhaps it may be possible to escape their sanbox with the os functions). You can see a confirmation of this on the following "app engine playground" example:

The two main changes I made were to remove the os calls and also to open the file in read only mode if that is what the user specified (previously, it would always open it in read/write).

Below is the benchmarking data and semidbm still has good numbers. I did not include dumbdbs because it seems to take forever.

Michael@speedplane-xps ~/Sites/DjangoAppEngineRepoGitHub/semidbm-speedplane/scripts
$ ./benchmark -d semidbm
Generating random data.
('Benchmarking:', <module 'semidbm' from '../semidbm/__init__.pyc'>)
    num_keys  : 1000000
    key_size  : 16
    value_size: 100
fill_random         : time:    10.763,   micros/ops:    10.763,   ops/s:  92911.184,  MB/s:     10.278
fill_sequential     : time:    10.722,   micros/ops:    10.722,   ops/s:  93269.966,  MB/s:     10.318
read_cold           : time:     1.976,   micros/ops:     1.976,   ops/s: 505970.477,  MB/s:     55.974
read_sequential     : time:     2.451,   micros/ops:     2.451,   ops/s: 407937.178,  MB/s:     45.129
read_hot            : time:    24.607,   micros/ops:    24.607,   ops/s:  40639.596,  MB/s:      4.496
read_random         : time:    24.588,   micros/ops:    24.588,   ops/s:  40670.486,  MB/s:      4.499
delete_sequential   : time:     2.778,   micros/ops:     2.778,   ops/s: 359975.234,  MB/s:     39.823


Michael@speedplane-xps ~/Sites/DjangoAppEngineRepoGitHub/semidbm-speedplane/scripts
$ ./benchmark
Generating random data.
('Benchmarking:', <module 'dbhash' from '/usr/lib/python2.7/dbhash.pyc'>)
    num_keys  : 1000000
    key_size  : 16
    value_size: 100
fill_random         : time:    36.349,   micros/ops:    36.349,   ops/s:  27510.787,  MB/s:      3.043
fill_sequential     : time:    35.391,   micros/ops:    35.391,   ops/s:  28255.564,  MB/s:      3.126
read_cold           : time:    18.575,   micros/ops:    18.575,   ops/s:  53835.801,  MB/s:      5.956
read_sequential     : time:    18.569,   micros/ops:    18.569,   ops/s:  53852.752,  MB/s:      5.958
read_hot            : time:    20.623,   micros/ops:    20.623,   ops/s:  48488.556,  MB/s:      5.364
read_random         : time:    20.712,   micros/ops:    20.712,   ops/s:  48282.082,  MB/s:      5.341
delete_sequential   : time:    19.752,   micros/ops:    19.752,   ops/s:  50627.869,  MB/s:      5.601

('Benchmarking:', <module 'dbm' from '/usr/lib/python2.7/lib-dynload/dbm.dll'>)
    num_keys  : 1000000
    key_size  : 16
    value_size: 100
fill_random         : time:    39.113,   micros/ops:    39.113,   ops/s:  25567.095,  MB/s:      2.828
fill_sequential     : time:    47.933,   micros/ops:    47.933,   ops/s:  20862.484,  MB/s:      2.308
read_cold           : time:    34.930,   micros/ops:    34.930,   ops/s:  28629.016,  MB/s:      3.167
read_sequential     : time:    34.920,   micros/ops:    34.920,   ops/s:  28636.910,  MB/s:      3.168
read_hot            : time:    32.996,   micros/ops:    32.996,   ops/s:  30306.920,  MB/s:      3.353
read_random         : time:    32.705,   micros/ops:    32.705,   ops/s:  30576.786,  MB/s:      3.383
delete_sequential   : time:    51.866,   micros/ops:    51.866,   ops/s:  19280.575,  MB/s:      2.133

('Benchmarking:', <module 'gdbm' from '/usr/lib/python2.7/lib-dynload/gdbm.dll'>)
    num_keys  : 1000000
    key_size  : 16
    value_size: 100
fill_random         : time:    39.101,   micros/ops:    39.101,   ops/s:  25575.005,  MB/s:      2.829
fill_sequential     : time:    47.383,   micros/ops:    47.383,   ops/s:  21104.561,  MB/s:      2.335
read_cold           : time:    34.742,   micros/ops:    34.742,   ops/s:  28783.801,  MB/s:      3.184
read_sequential     : time:    36.269,   micros/ops:    36.269,   ops/s:  27572.091,  MB/s:      3.050
read_hot            : time:    34.382,   micros/ops:    34.382,   ops/s:  29085.194,  MB/s:      3.218
read_random         : time:    33.368,   micros/ops:    33.368,   ops/s:  29968.835,  MB/s:      3.315
delete_sequential   : time:    52.046,   micros/ops:    52.046,   ops/s:  19213.650,  MB/s:      2.126

Nov 13 '15 16:11 speedplane

Could you post the benchmark for semidbm on your machine against the upstream master branch? It would be helpful to see what the numbers are on your machine before/after the change, but I'm only seeing the semidbm benchmark numbers once (I'm assuming those are the numbers with your changes included?).

Nov 13 '15 18:11 jamesls

Good point. They're below and it looks like a mixed bag. read_hot and read_random take a major hit, but the other reads and deleting are improved. This could be due to different caching mechanisms behind open and os.open, but I'm not quite sure.

With Changes

$ ./benchmark -d semidbm
Generating random data.
('Benchmarking:', <module 'semidbm' from '../semidbm/__init__.py'>)
    num_keys  : 1000000
    key_size  : 16
    value_size: 100
fill_random         : time:    11.242,   micros/ops:    11.242,   ops/s:  88954.788,  MB/s:      9.841
fill_sequential     : time:    11.277,   micros/ops:    11.277,   ops/s:  88673.637,  MB/s:      9.810
read_cold           : time:     1.918,   micros/ops:     1.918,   ops/s: 521279.371,  MB/s:     57.667
read_sequential     : time:     1.962,   micros/ops:     1.962,   ops/s: 509579.312,  MB/s:     56.373
read_hot            : time:    24.310,   micros/ops:    24.310,   ops/s:  41135.242,  MB/s:      4.551
read_random         : time:    25.250,   micros/ops:    25.250,   ops/s:  39604.638,  MB/s:      4.381
delete_sequential   : time:     2.885,   micros/ops:     2.885,   ops/s: 346618.647,  MB/s:     38.345

Without changes

Generating random data.
('Benchmarking:', <module 'semidbm' from '../semidbm/__init__.py'>)
    num_keys  : 1000000
    key_size  : 16
    value_size: 100
fill_random         : time:     7.539,   micros/ops:     7.539,   ops/s: 132641.912,  MB/s:     14.674
fill_sequential     : time:     7.258,   micros/ops:     7.258,   ops/s: 137788.247,  MB/s:     15.243
read_cold           : time:     3.571,   micros/ops:     3.571,   ops/s: 280070.545,  MB/s:     30.983
read_sequential     : time:     3.655,   micros/ops:     3.655,   ops/s: 273628.889,  MB/s:     30.271
read_hot            : time:     6.448,   micros/ops:     6.448,   ops/s: 155096.972,  MB/s:     17.158
read_random         : time:     6.647,   micros/ops:     6.647,   ops/s: 150450.532,  MB/s:     16.644
delete_sequential   : time:     5.536,   micros/ops:     5.536,   ops/s: 180638.452,  MB/s:     19.983

Nov 14 '15 02:11 speedplane

Also... I'm running on Windows 10. I would not be surprised if these numbers were different on other systems.

Nov 14 '15 02:11 speedplane

Here's what I get on a macbook:

Before

$ python scripts/benchmark -d semidbm
Generating random data.
('Benchmarking:', <module 'semidbm' from '/Users/jsaryer/Source/github/semidbm/semidbm/__init__.pyc'>)
    num_keys  : 1000000
    key_size  : 16
    value_size: 100
fill_random         : time:    10.070,   micros/ops:    10.070,   ops/s:  99305.111,  MB/s:     10.986
fill_sequential     : time:    10.278,   micros/ops:    10.278,   ops/s:  97292.893,  MB/s:     10.763
read_cold           : time:     4.332,   micros/ops:     4.332,   ops/s: 230843.345,  MB/s:     25.537
read_sequential     : time:     4.249,   micros/ops:     4.249,   ops/s: 235355.697,  MB/s:     26.037
read_hot            : time:     4.811,   micros/ops:     4.811,   ops/s: 207877.737,  MB/s:     22.997
read_random         : time:     5.043,   micros/ops:     5.043,   ops/s: 198304.023,  MB/s:     21.938
delete_sequential   : time:     8.475,   micros/ops:     8.475,   ops/s: 117994.168,  MB/s:     13.053

After

$ python scripts/benchmark -d semidbm
Generating random data.
('Benchmarking:', <module 'semidbm' from '/Users/jsaryer/Source/github/semidbm/semidbm/__init__.pyc'>)
    num_keys  : 1000000
    key_size  : 16
    value_size: 100
fill_random         : time:    17.293,   micros/ops:    17.293,   ops/s:  57828.084,  MB/s:      6.397
fill_sequential     : time:    17.148,   micros/ops:    17.148,   ops/s:  58316.043,  MB/s:      6.451
read_cold           : time:     3.926,   micros/ops:     3.926,   ops/s: 254683.879,  MB/s:     28.175
read_sequential     : time:     3.813,   micros/ops:     3.813,   ops/s: 262273.826,  MB/s:     29.014
read_hot            : time:     8.494,   micros/ops:     8.494,   ops/s: 117735.527,  MB/s:     13.025
read_random         : time:     7.479,   micros/ops:     7.479,   ops/s: 133714.742,  MB/s:     14.792
delete_sequential   : time:     4.208,   micros/ops:     4.208,   ops/s: 237644.783,  MB/s:     26.290

Here's the percentage increases with this change on my machine (lower is better, negative means benchmark was faster not slower):

fill_random          71.6981%
fill_sequential      66.8939%
read_cold            -9.4688%
read_sequential      -10.1415%
read_hot             76.5073%
read_random          48.2143%
delete_sequential    -50.4132%

Given that this increases perf significantly for 4/7 benchmarks I don't think this change can be merged as is. If you can get the perf difference to single digit percentages I'd be ok with merging.

Nov 16 '15 00:11 jamesls

semidbm
semidbm copied to clipboard

Support for Google App Engine in Read Only Mode

The Problem

The Solution

How

With Changes

Without changes

Before

After

semidbm semidbm copied to clipboard

Support for Google App Engine in Read Only Mode

The Problem

The Solution

How

With Changes

Without changes

Before

After

semidbm
semidbm copied to clipboard