SleekDB icon indicating copy to clipboard operation
SleekDB copied to clipboard

High load distributed version

Open justanthonylee opened this issue 4 years ago • 12 comments

I love this project, I have been exploring it all day and worse case it will serve as a base for what I need.

I am really looking for ideas or suggestions on how to use this in the best case. I am working on a large indexing system to store and index pages as text. It will store longer strings in a "database" with a key for easy lookup and fetching and page content down to sentences can be split and stored in multiple "stores". The only downside I am having is access across multiple servers talking to the data at once, as far as I can see as it's not an outside access database this might be hard.

Any ideas or is this something on the roadmap? I am looking at this route as search will be really low and highly cached (woot already a check in this), and most data once generated will be cached, but I am looking to put 10-60k inserts a second.

Other then that this is amazing and I will be messing with it for the next week.

justanthonylee avatar Jan 06 '20 06:01 justanthonylee

Hi @anthonyrossbach I really appreciate that you are willing to invest your time to explore the internals of SleekDB!

When I have started working for SleekDB my primary intention was to make it suitable for high read operation as if it's fetching data from a static file(The cached JSON files)! So I quoted on the website that,

SleekDB works great as the database engine for low to medium traffic websites.

You might have already realised that for each insert operation it creates a new JSON file behind the scene. So, in your use case there will be 10k files being created behind the scene every second, and this might cause some inode issue.

For example, an inode is allocated to a file so, if you have this huge amount of files, all 1 byte each, you'll run out of inodes long before you run out of disk. Although I haven't faced it yet while using SleekDB, but this is what I have been keep thinking that might go wrong when a large amount of data set is being inserted. So I would recommend you to do a similar experiment if possible :)

Besides that, I was also thinking about a solution for this problem as well! As a result I came up with this very simple and basic ideas: https://github.com/rakibtg/SleekDB/projects/1#card-25190411 Here are the lists for you,

Single file based store system

  • [ ] Each store should have at least one file and split them based on a memory limit eg: 200MB. So if we have 1GB of data then the store will have 5 files.
  • [ ] Each line should contain a document inline.
  • [ ] Cache files could also be a single file based on a memory limit. Indexed by the hash token for a particular query.
  • [ ] Instead of traversing through files we will traverse through lines, and the data file will be consumed line by line instead of having the full buffer in the memory.
  • [ ] Same logic goes to update/create/delete operations.

Or,

There could be a better alternative solution to the inode issue as well that I might not know!

However, implementing this features will be not that easy I think! If you have any ideas feel free to let me know and share them here. It would be really exciting if we can use SleekDB to get to handle this large amount of request.

Oh, and for distributed system I think we can add some API to let them communicate! What do you think?

Email me if required: [email protected]

rakibtg avatar Jan 06 '20 13:01 rakibtg

I love the idea of this and I am willing to make modifications and push changes like this. I have used static file engines a lot early on and I know due to larger projects I don’t want to waste overhead on MySQL or engines when most data will be read once the original data is written.

I will do some digging, I have a few ways I can think of to add the features you suggested and maybe even a CRON que where inserts that are non importance can be added to a que for insert later when it’s free. This would stop file locks being a problem with large inserts.

I will see what I can think of and I might just add the features if that’s ok with you :).

justanthonylee avatar Jan 06 '20 17:01 justanthonylee

That would be great! 💪

Please send PR and if you need to discuss about any part of the code or idea feel free to knock me in twitter or other platform of your choice.

Cheers

rakibtg avatar Jan 07 '20 12:01 rakibtg

Hi @anthonyrossbach Do you have any update?

rakibtg avatar Feb 20 '20 05:02 rakibtg

I am still playing around with the idea, got distracted with a lot of other projects. I may have a good use case here soon outside of the original.

justanthonylee avatar Feb 20 '20 05:02 justanthonylee

Great! I'm going to keep this issue Open.

rakibtg avatar Feb 20 '20 05:02 rakibtg

for distributed system it's possible, for example master node, second node and other node, we can clustering horizontal and make infinity storage, for inode, i'ts another issue for web hosting i think have two options for combine storage as single file or default one per one

derit avatar May 09 '20 14:05 derit

It is not a good idea to store medium / large texts or images in this type of database.

Extreme example: 500 objects like this $obj->id = token(32); $obj->content = text (length 5,000 characters)

When saving and loading, >95% unnecessary data is encoded, decoded, read and written. Better: $text = new SideLoadObject(type=text,text(5,000 characters)); $obj->id = token(32); $obj->content = $text;

$obj->content = { sl_type = text //sl_type = text / image / json (big obj) path = path/to/storage/ file = md5($id.$content).txt
}

This keeps the size of the important files small! Furthermore, there is no reason to execute json de- or encode on such a long text if it is stored in a .txt file anyway.

The main document must contain everything that serves as index (filter criterium). Simple example $obj = new Document(); $obj->__id = autogenrated; $obj->username = foobar Master $obj->firstname = Foo $obj->lastname = Master $obj->ip = 1.1.1.1 $obj->log = new SideLoadDocument(json,[]) $obj->tokens = ['abc','def']

$log = [ ip => '1.1.1.1' token => 'abc' date => time() user-agent => 'foobar v42' ]

$obj->log->insert($log); $obj->save();

You can also apply a filter to SideLoad objects. However, ALWAYS first reduce the number of documents via the indexes and then run "subquery" on SideLoad JSON. $table->filter()->where(ip).equal('1.1.1.1')->and(log)->child(date).between(time(),time()-500);

This allows you to pack gigabytes of data into small and performant indexes.

Steinweber avatar Jun 04 '20 11:06 Steinweber

I wanted to know how to make non-blocking file writes in PHP and found a nice Article about the performance of non-blocking writes to a File with PHP:

https://grobmeier.solutions/performance-ofnonblocking-write-to-files-via-php-21082009.html

The author used a script that wrote 10000 times 100 characters in a freshly created log file.

The two best solutions he came up with are:

$fp = fopen($file, 'a+');

flock($fp, LOCK_UN);
while($count > $loop) {
  if (flock($fp, LOCK_EX)) {
    fwrite($fp, $text);
  }
  flock($fp, LOCK_UN);
}
fclose($fp);

0.148998975754 seconds

$fp = fopen($file, 'a+');
stream_set_blocking($fp, 0);

while($count > $loop) {
  if (flock($fp, LOCK_EX)) {
    fwrite($fp, $text);
  }
  flock($fp, LOCK_UN);
}
fclose($fp);

0.149605989456 seconds

I don't know if this can help implementing this feature but this feature sounds really nice and I hope it will be finished soon.

If you need any help feel free to contact me.

Timu57 avatar Sep 15 '20 09:09 Timu57

@Timu57 I think writing is not a big issue but deleting and updating data is the most difficult thing to handle if we target a single file based system.

rakibtg avatar Sep 15 '20 15:09 rakibtg

Are distributed projects supported or are there plans to support them?

BinZhiZhu avatar Feb 04 '21 08:02 BinZhiZhu

So I searched for "HTTP API" and found this issue, I assume that a PHP process that serves an HTTP API is going to fix distribution?

rennokki avatar Feb 06 '21 07:02 rennokki