tiedot
tiedot copied to clipboard
Minimal setup takes ridiculous amounts of disk space
I created the most minimal database to test tiedot a bit.
I created one collection, test
, and inserted one document into it:
{"test": "yep"}
I checked the filesystem, and there's 768MB of files created by tiedot.
I can understand some preallocation, but having the minimal setup take 3/4 of a GIGABYTE of disk space is quite excessive.
I've pasted some text from a wiki page I wrote when I was evaluating tiedot for a project. The line numbers are no longer correct, but it looks like all the settings are still the same. Hope that helps!
Tiedot configuration
Tiedot pre-allocates files for all of the data structures it uses, and grows them when necessary. The default config creates a 32MB file for the data, and then one 32MB file per index (there's an
id
index by default, plus whatever indices we would create). That's a lot of wasted space for our constrained devices. Fortunately, the author has defined constants for all of the settings, and by choosing different initial values it's easy to start with a small disk footprint that can grow as needed. The settings below assume that there's not a lot of flash space available for storing reports, and as such no more than a few thousand reports will be stored at any given time.There are two settings in
data/collection.go
that control the size of the data file and maximum document size (lines 16-17):COL_FILE_GROWTH = 32 * 1048576 // Collection file initial size & size growth (32 MBytes) DOC_MAX_ROOM = 2 * 1048576 // Max document size (2 MBytes)
I changed these values to 4MB and 1MB respectively, which causes tiedot to pre-allocate a 4MB data file initially, and then grow that in 4MB increments.
There are several settings in
data/hashtable.go
that control the behavior of the on-disk hashtable implementation (lines 16-22):HT_FILE_GROWTH = 32 * 1048576 // Hash table file initial size & file growth ENTRY_SIZE = 1 + 8 + 8 // Hash entry size: validity (single byte), key (uint64), value (uint64) BUCKET_HEADER = 8 // Bucket header size: next chained bucket number (int 10 bytes) PER_BUCKET = 16 // Entries per bucket HASH_BITS = 16 // Number of hash key bits BUCKET_SIZE = BUCKET_HEADER + PER_BUCKET*ENTRY_SIZE // Size of a bucket INITIAL_BUCKETS = uint64(65536) // Initial number of buckets == 2 ^ HASH_BITS
I changed the following settings:
HT_FILE_GROWTH = 1 * 1048576 // Hash table file initial size & file growth HASH_BITS = 11 // Number of hash key bits INITIAL_BUCKETS = uint64(2048) // Initial number of buckets == 2 ^ HASH_BITS
With those initial settings, tiedot initially allocates a 1MB file per index.
Hyvaa huomenta!
I entirely agree with you, the initial data file size can be quite large, and you probably have a 6-core Xeon E3 or Intel i7 extreme edition.
Those numbers were written down as constants because altering them after creation of a collection is quite a challenging task. If tiedot used a tree or skip list data structure for indexes, those huge numbers can be avoided. And that reminds me of the famous quote "today's constant is tomorrow's variable".
tiedot has seen very infrequent updates in the recent months, therefore it may be viable to maintain a fork with tweaked constants. I hope that helps.
Is there any reason these couldn't be configured using environmental variables? If you wanted me to make a pull request I'm sure I could get around to it in the next couple days.
It was an incorrect decision to make them constants in the beginning.
How about this: write down collection and hashtable parameters into a JSON or text file underneath database directory. If the file exists, the parameters from the file will be used to operate on collections; if it does not exist, the default value (the current constants) will be used instead, and the file shall be created and default values written down.
Sounds like a plan, I will start working on a pr
Wonderful to hear! Many thanks for your help.
#157 Added this, let me know what you think.
import (
"github.com/HouzuoGuo/tiedot/db"
"strconv"
"io/ioutil"
"os"
//"sync"
)
const (
NUM_PARTS = 2
)
//
// Pre-write concurrent configuration to prevent excessive hard disk pre-allocation
//
func writeDBConfig(dbname string) (err error) {
// Multi-threaded locking
num := []byte(strconv.Itoa(NUM_PARTS))
numFile := "./db_base_path/"+ dbname +"/"+ db.PART_NUM_FILE
// Ignore if the file already exists
if _, err := os.Stat(numFile); err == nil {
return nil
}
if err := ioutil.WriteFile(numFile, num, 0600); err != nil {
return err
}
return nil
}