semidbm
semidbm copied to clipboard
Support for Google App Engine in Read Only Mode
The Problem
There is often a need on Google App Engine to have a simple read-only database or key value store. This comes up when one wants to store a large dictionary of information which does not need to be updated very often. Examples include timezone data, list of state or country capitals, annual financial information, or any database that only needs to be updated once every few years.
Storing this information in a python dictionary is an option, but would be memory inefficient, especially if used infrequently. Storing it in the GAE datastore can be complex to maintain. Storing it in the GAE memcache is not acceptable because it is transient.
The Solution
A simpler solution is to build a database offline, then upload it to GAE, and then access the database on GAE in read only mode. This commit allows a user of semidbm
to do just that.
How
Google App Engine runs python in a sandbox, and does not support os
file descriptor access that is used by semidbm
. It also does not support opening files for writing, but it does support read only access.
This commit changes those function calls to the standard python open
, read
, and write
functions, which are supported in the GAE sandbox. Further, if the semidbm
database is opened in read-only mode, then the appropriate flags are used which work on GAE.
I have run the tests on this PR, and they all pass.
Looks like I need to do some fixing for python 3. Please let me know if this PR is generally acceptable and I'll clean up the python 3 issues.
I believe the main reason the lower level os
functions were used was because of performance. There's a scripts/benchmark
script that will measure performance for various operations. More info here: http://semidbm.readthedocs.org/en/latest/benchmarks.html
I would not be opposed to this change provided the benchmark numbers from that script don't change significantly.
Also, I'm not really familiar with google app engine. You're saying that open(...)
works but import os; os.open(...)
is not allowed? Could you point me to any documentation about this (I couldn't find any)? I'd like to just get a little more background info on the scope of restrictions for GAE so I understand the problem better. Thanks.
Yes, I can confirm that open
works and os.open
does not work. I don't know why this is, but there is probably a good reason for it (perhaps it may be possible to escape their sanbox with the os
functions). You can see a confirmation of this on the following "app engine playground" example:
The two main changes I made were to remove the os
calls and also to open the file in read only mode if that is what the user specified (previously, it would always open it in read/write).
Below is the benchmarking data and semidbm still has good numbers. I did not include dumbdbs
because it seems to take forever.
Michael@speedplane-xps ~/Sites/DjangoAppEngineRepoGitHub/semidbm-speedplane/scripts
$ ./benchmark -d semidbm
Generating random data.
('Benchmarking:', <module 'semidbm' from '../semidbm/__init__.pyc'>)
num_keys : 1000000
key_size : 16
value_size: 100
fill_random : time: 10.763, micros/ops: 10.763, ops/s: 92911.184, MB/s: 10.278
fill_sequential : time: 10.722, micros/ops: 10.722, ops/s: 93269.966, MB/s: 10.318
read_cold : time: 1.976, micros/ops: 1.976, ops/s: 505970.477, MB/s: 55.974
read_sequential : time: 2.451, micros/ops: 2.451, ops/s: 407937.178, MB/s: 45.129
read_hot : time: 24.607, micros/ops: 24.607, ops/s: 40639.596, MB/s: 4.496
read_random : time: 24.588, micros/ops: 24.588, ops/s: 40670.486, MB/s: 4.499
delete_sequential : time: 2.778, micros/ops: 2.778, ops/s: 359975.234, MB/s: 39.823
Michael@speedplane-xps ~/Sites/DjangoAppEngineRepoGitHub/semidbm-speedplane/scripts
$ ./benchmark
Generating random data.
('Benchmarking:', <module 'dbhash' from '/usr/lib/python2.7/dbhash.pyc'>)
num_keys : 1000000
key_size : 16
value_size: 100
fill_random : time: 36.349, micros/ops: 36.349, ops/s: 27510.787, MB/s: 3.043
fill_sequential : time: 35.391, micros/ops: 35.391, ops/s: 28255.564, MB/s: 3.126
read_cold : time: 18.575, micros/ops: 18.575, ops/s: 53835.801, MB/s: 5.956
read_sequential : time: 18.569, micros/ops: 18.569, ops/s: 53852.752, MB/s: 5.958
read_hot : time: 20.623, micros/ops: 20.623, ops/s: 48488.556, MB/s: 5.364
read_random : time: 20.712, micros/ops: 20.712, ops/s: 48282.082, MB/s: 5.341
delete_sequential : time: 19.752, micros/ops: 19.752, ops/s: 50627.869, MB/s: 5.601
('Benchmarking:', <module 'dbm' from '/usr/lib/python2.7/lib-dynload/dbm.dll'>)
num_keys : 1000000
key_size : 16
value_size: 100
fill_random : time: 39.113, micros/ops: 39.113, ops/s: 25567.095, MB/s: 2.828
fill_sequential : time: 47.933, micros/ops: 47.933, ops/s: 20862.484, MB/s: 2.308
read_cold : time: 34.930, micros/ops: 34.930, ops/s: 28629.016, MB/s: 3.167
read_sequential : time: 34.920, micros/ops: 34.920, ops/s: 28636.910, MB/s: 3.168
read_hot : time: 32.996, micros/ops: 32.996, ops/s: 30306.920, MB/s: 3.353
read_random : time: 32.705, micros/ops: 32.705, ops/s: 30576.786, MB/s: 3.383
delete_sequential : time: 51.866, micros/ops: 51.866, ops/s: 19280.575, MB/s: 2.133
('Benchmarking:', <module 'gdbm' from '/usr/lib/python2.7/lib-dynload/gdbm.dll'>)
num_keys : 1000000
key_size : 16
value_size: 100
fill_random : time: 39.101, micros/ops: 39.101, ops/s: 25575.005, MB/s: 2.829
fill_sequential : time: 47.383, micros/ops: 47.383, ops/s: 21104.561, MB/s: 2.335
read_cold : time: 34.742, micros/ops: 34.742, ops/s: 28783.801, MB/s: 3.184
read_sequential : time: 36.269, micros/ops: 36.269, ops/s: 27572.091, MB/s: 3.050
read_hot : time: 34.382, micros/ops: 34.382, ops/s: 29085.194, MB/s: 3.218
read_random : time: 33.368, micros/ops: 33.368, ops/s: 29968.835, MB/s: 3.315
delete_sequential : time: 52.046, micros/ops: 52.046, ops/s: 19213.650, MB/s: 2.126
Could you post the benchmark for semidbm on your machine against the upstream master branch? It would be helpful to see what the numbers are on your machine before/after the change, but I'm only seeing the semidbm benchmark numbers once (I'm assuming those are the numbers with your changes included?).
Good point. They're below and it looks like a mixed bag. read_hot
and read_random
take a major hit, but the other reads and deleting are improved. This could be due to different caching mechanisms behind open
and os.open
, but I'm not quite sure.
With Changes
$ ./benchmark -d semidbm
Generating random data.
('Benchmarking:', <module 'semidbm' from '../semidbm/__init__.py'>)
num_keys : 1000000
key_size : 16
value_size: 100
fill_random : time: 11.242, micros/ops: 11.242, ops/s: 88954.788, MB/s: 9.841
fill_sequential : time: 11.277, micros/ops: 11.277, ops/s: 88673.637, MB/s: 9.810
read_cold : time: 1.918, micros/ops: 1.918, ops/s: 521279.371, MB/s: 57.667
read_sequential : time: 1.962, micros/ops: 1.962, ops/s: 509579.312, MB/s: 56.373
read_hot : time: 24.310, micros/ops: 24.310, ops/s: 41135.242, MB/s: 4.551
read_random : time: 25.250, micros/ops: 25.250, ops/s: 39604.638, MB/s: 4.381
delete_sequential : time: 2.885, micros/ops: 2.885, ops/s: 346618.647, MB/s: 38.345
Without changes
Generating random data.
('Benchmarking:', <module 'semidbm' from '../semidbm/__init__.py'>)
num_keys : 1000000
key_size : 16
value_size: 100
fill_random : time: 7.539, micros/ops: 7.539, ops/s: 132641.912, MB/s: 14.674
fill_sequential : time: 7.258, micros/ops: 7.258, ops/s: 137788.247, MB/s: 15.243
read_cold : time: 3.571, micros/ops: 3.571, ops/s: 280070.545, MB/s: 30.983
read_sequential : time: 3.655, micros/ops: 3.655, ops/s: 273628.889, MB/s: 30.271
read_hot : time: 6.448, micros/ops: 6.448, ops/s: 155096.972, MB/s: 17.158
read_random : time: 6.647, micros/ops: 6.647, ops/s: 150450.532, MB/s: 16.644
delete_sequential : time: 5.536, micros/ops: 5.536, ops/s: 180638.452, MB/s: 19.983
Also... I'm running on Windows 10. I would not be surprised if these numbers were different on other systems.
Here's what I get on a macbook:
Before
$ python scripts/benchmark -d semidbm
Generating random data.
('Benchmarking:', <module 'semidbm' from '/Users/jsaryer/Source/github/semidbm/semidbm/__init__.pyc'>)
num_keys : 1000000
key_size : 16
value_size: 100
fill_random : time: 10.070, micros/ops: 10.070, ops/s: 99305.111, MB/s: 10.986
fill_sequential : time: 10.278, micros/ops: 10.278, ops/s: 97292.893, MB/s: 10.763
read_cold : time: 4.332, micros/ops: 4.332, ops/s: 230843.345, MB/s: 25.537
read_sequential : time: 4.249, micros/ops: 4.249, ops/s: 235355.697, MB/s: 26.037
read_hot : time: 4.811, micros/ops: 4.811, ops/s: 207877.737, MB/s: 22.997
read_random : time: 5.043, micros/ops: 5.043, ops/s: 198304.023, MB/s: 21.938
delete_sequential : time: 8.475, micros/ops: 8.475, ops/s: 117994.168, MB/s: 13.053
After
$ python scripts/benchmark -d semidbm
Generating random data.
('Benchmarking:', <module 'semidbm' from '/Users/jsaryer/Source/github/semidbm/semidbm/__init__.pyc'>)
num_keys : 1000000
key_size : 16
value_size: 100
fill_random : time: 17.293, micros/ops: 17.293, ops/s: 57828.084, MB/s: 6.397
fill_sequential : time: 17.148, micros/ops: 17.148, ops/s: 58316.043, MB/s: 6.451
read_cold : time: 3.926, micros/ops: 3.926, ops/s: 254683.879, MB/s: 28.175
read_sequential : time: 3.813, micros/ops: 3.813, ops/s: 262273.826, MB/s: 29.014
read_hot : time: 8.494, micros/ops: 8.494, ops/s: 117735.527, MB/s: 13.025
read_random : time: 7.479, micros/ops: 7.479, ops/s: 133714.742, MB/s: 14.792
delete_sequential : time: 4.208, micros/ops: 4.208, ops/s: 237644.783, MB/s: 26.290
Here's the percentage increases with this change on my machine (lower is better, negative means benchmark was faster not slower):
fill_random 71.6981%
fill_sequential 66.8939%
read_cold -9.4688%
read_sequential -10.1415%
read_hot 76.5073%
read_random 48.2143%
delete_sequential -50.4132%
Given that this increases perf significantly for 4/7 benchmarks I don't think this change can be merged as is. If you can get the perf difference to single digit percentages I'd be ok with merging.