tiedot Minimal setup takes ridiculous amounts of disk space

I created the most minimal database to test tiedot a bit.

I created one collection, test, and inserted one document into it:

{"test": "yep"}

I checked the filesystem, and there's 768MB of files created by tiedot.

I can understand some preallocation, but having the minimal setup take 3/4 of a GIGABYTE of disk space is quite excessive.

Feb 05 '17 21:02 lietu

I've pasted some text from a wiki page I wrote when I was evaluating tiedot for a project. The line numbers are no longer correct, but it looks like all the settings are still the same. Hope that helps!

Tiedot configuration

Tiedot pre-allocates files for all of the data structures it uses, and grows them when necessary. The default config creates a 32MB file for the data, and then one 32MB file per index (there's an id index by default, plus whatever indices we would create). That's a lot of wasted space for our constrained devices. Fortunately, the author has defined constants for all of the settings, and by choosing different initial values it's easy to start with a small disk footprint that can grow as needed. The settings below assume that there's not a lot of flash space available for storing reports, and as such no more than a few thousand reports will be stored at any given time.

There are two settings in data/collection.go that control the size of the data file and maximum document size (lines 16-17):
	COL_FILE_GROWTH = 32 * 1048576 // Collection file initial size & size growth (32 MBytes)
	DOC_MAX_ROOM    = 2 * 1048576  // Max document size (2 MBytes)
I changed these values to 4MB and 1MB respectively, which causes tiedot to pre-allocate a 4MB data file initially, and then grow that in 4MB increments.

There are several settings in data/hashtable.go that control the behavior of the on-disk hashtable implementation (lines 16-22):
	HT_FILE_GROWTH  = 32 * 1048576                          // Hash table file initial size & file growth
	ENTRY_SIZE      = 1 + 8 + 8                             // Hash entry size: validity (single byte), key (uint64), value (uint64)
	BUCKET_HEADER   = 8                                     // Bucket header size: next chained bucket number (int 10 bytes)
	PER_BUCKET      = 16                                    // Entries per bucket
	HASH_BITS       = 16                                    // Number of hash key bits
	BUCKET_SIZE     = BUCKET_HEADER + PER_BUCKET*ENTRY_SIZE // Size of a bucket
	INITIAL_BUCKETS = uint64(65536)                         // Initial number of buckets == 2 ^ HASH_BITS
I changed the following settings:
	HT_FILE_GROWTH  = 1 * 1048576                          // Hash table file initial size & file growth
	HASH_BITS       = 11                                    // Number of hash key bits
	INITIAL_BUCKETS = uint64(2048)                         // Initial number of buckets == 2 ^ HASH_BITS
With those initial settings, tiedot initially allocates a 1MB file per index.

Feb 05 '17 21:02 mmindenhall

Hyvaa huomenta!

I entirely agree with you, the initial data file size can be quite large, and you probably have a 6-core Xeon E3 or Intel i7 extreme edition.

Those numbers were written down as constants because altering them after creation of a collection is quite a challenging task. If tiedot used a tree or skip list data structure for indexes, those huge numbers can be avoided. And that reminds me of the famous quote "today's constant is tomorrow's variable".

tiedot has seen very infrequent updates in the recent months, therefore it may be viable to maintain a fork with tweaked constants. I hope that helps.

Feb 07 '17 09:02 HouzuoGuo

Is there any reason these couldn't be configured using environmental variables? If you wanted me to make a pull request I'm sure I could get around to it in the next couple days.

Nov 14 '17 20:11 d1ngd0

It was an incorrect decision to make them constants in the beginning.

How about this: write down collection and hashtable parameters into a JSON or text file underneath database directory. If the file exists, the parameters from the file will be used to operate on collections; if it does not exist, the default value (the current constants) will be used instead, and the file shall be created and default values written down.

Nov 15 '17 08:11 HouzuoGuo

Sounds like a plan, I will start working on a pr

Nov 15 '17 19:11 d1ngd0

Wonderful to hear! Many thanks for your help.

Nov 15 '17 20:11 HouzuoGuo

#157 Added this, let me know what you think.

Nov 24 '17 19:11 d1ngd0

import (
	"github.com/HouzuoGuo/tiedot/db"
	"strconv"
	"io/ioutil"
	"os"
	//"sync"
)

const (
  NUM_PARTS = 2
)

//
// Pre-write concurrent configuration to prevent excessive hard disk pre-allocation
//
func writeDBConfig(dbname string) (err error) {
	// Multi-threaded locking
	num := []byte(strconv.Itoa(NUM_PARTS))
	numFile := "./db_base_path/"+ dbname +"/"+ db.PART_NUM_FILE
  
	// Ignore if the file already exists
	if _, err := os.Stat(numFile); err == nil {
		return nil
	}
	if err := ioutil.WriteFile(numFile, num, 0600); err != nil {
		return err
	}
	return nil
}

Sep 05 '18 03:09 yanmingsohu

tiedot tiedot copied to clipboard

Minimal setup takes ridiculous amounts of disk space

Tiedot configuration

tiedot
tiedot copied to clipboard