Processing a big data
Hi,
I have a few table in database, total size is 16 gigabytes ; 23,000,000 rows .
I deploy it on AWS ECE2 ; 1gb ram .
If anyone get an idea how to index all of it with fast as possible?
thank you
I have the same situation... the model::addAllToIndex causes a 502 gateway, basically times out
the timeout error is cause by server, the server cant query all the data simultaneously , there is must a proper way to index the big data .
I don't know better solution than using chunks for big data. I've got own question with example - you may take example from there
Hi @Dmitri10 , thanks for your reply , i have 23 millions of rows data , I try to add to index , 1 hour able to index around 10k record, can you imagine how many days to take for 23 millions of data?
@alongmuaz 10 thousand in one hour sounds too long, i managed to sort my issue out, and this was the solution. 446 thousand records took about 3mins. This was in a command i could execute on the server by the way.
` /*******************************ORGS*************************************/ $orgsApi = new OrgsAPI(); $this->info('Starting elastic indexing....');
$orgsApi->esCreateIndex();
$orgCount = DB::table('orgs')->count();
$this->info("Number of orgs: " . $orgCount);
$this->info('Adding org documents...');
$bar = $this->output->createProgressBar(1);
$orgsAmountPerChunk = 1000;
$orgsApi->esLoadData($orgsAmountPerChunk);
$this->comment("HP: Orgs Indexed");
$bar->advance();
$bar->finish();
$this->info(' HP Orgs Indexing Complete!');
$check = new DailyChecks();
$check->date = date('Y-m-d');
$check->process = "hp index - orgs";
$check->save();`
Hi @scripta55 , do you run these code using php artisan tinker? thank you
@alongmuaz yes, check this out: https://laravel.com/docs/5.3/artisan#writing-commands
@alongmuaz, @scripta55 is right - your operation is too long for 10k records. For example I've got extra logic
Route::get('fullRefreshIndex', function () {
set_time_limit(0);
ini_set('max_execution_time', 0);
try {
\App\Models\User::deleteIndex();
} catch (\Elasticsearch\Common\Exceptions\Missing404Exception $e) {
print_r('no such index');
}
try {
\App\Models\User::createIndex();
} catch (\Elasticsearch\Common\Exceptions\Missing404Exception $e) {
print_r('cant create index');
}
$cards = \App\Models\CardType::whereStatus('active')->get();
$cardsArray = [];
$chunkSize = config('app.chunk_user_model');
foreach ($cards as $card) {
$cardsArray[] = $card->id;
}
$users = \App\Models\User::whereStatus('active')
->whereIn('card_type_id', $cardsArray);
$users->chunk($chunkSize, function ($users) {
print_r('new chunk');
echo '<br/>';
// $users->each(function ($user) {
// $user->addToIndex();
// });
/* just saw that it would be a little faster than looping */
$users->addToIndex();
});
});
and for 160k records it took 4 minutes...
@scripta55, could you show us your function esLoadData please?
$orgsApi->esLoadData($orgsAmountPerChunk);
@Dmitri10 this could be slowing it down?
$users->each(function ($user) {
$user->addToIndex();
});
You can use the bulk function that basically takes the chunk and inserts into elastic, also i noticed the more you chunk the more time the worker takes to load into memory before inserting, so as low as possible as you can go on the chunk the better.
as for function in esLoadData:
ReviewsElasticSearch::chunk($chunkamount, function ($flights) { $flights->addToIndex(); });
@scripta55 Thanks. Yes, I updated my code before your answer (in comments) and showed another example with using addToIndex to builder 4 days ago but little chunks don't work faster for me, so if you use nothing new.. then it seems your server has better RAM 👍
@Dmitri10 possibly so, i ran this test on a virtual box with 2 gigs ram
you can directly index from sql to elastic, that could work for you? using JDBC it is what i intend to do at a later stage
@scripta55 Yes, I've got only 1 gb ram) thanks a lot for JDBC, I didn't know about it anything!
Enjoy! it would be a great solution without the overhead of laravel
@scripta55 , you mean Logstash - JDBC?
this one https://github.com/jprante/elasticsearch-jdbc
There are cool examples with it, its standalone; it runs on your server/box independently
Cool library except one thing - there are too much opened issues now and you may catch one of them and wait fix...
@Dmitri10 true, however most of those issues are just lack of understanding from most users. Its a really straight forward solution to execute :) When i tried it out, i had lots of questions simply because my mindset was not on track with how it works and what it simply is supposed to do. give it a go!
Thanks one more time anyway! If i have free time I'll try it.
@scripta55 , where the OrgsAPI() referring to?
@alongmuaz , It's his own class where he added his functions esCreateIndex(), esLoadData(), etc.
And he told you about main function for indexing data using model and chunks:
as for function in esLoadData: ReviewsElasticSearch::chunk($chunkamount, function ($flights) { $flights->addToIndex(); });
hai @scripta55 , i got this error when using your codes :
[Elasticsearch\Common\Exceptions\ServerErrorResponseException]
hi @alongmuaz are you able to curl to you able to curl to the elastic server? from outside your homestead or http get from elastic
your result should be like this: http://d.pr/i/1hExn
hi @scripta55 , yes , actually according to aws es , theres data inserted into nodes . suddenly when running your codes after 10 minutes .
http://192.168.10.10:9200/YOUR-INDEX-NAME/_count
please share the results you get from there. Also how many records are you intending to index into elastic?
http://d.pr/i/1cTMd
did you load 2775000 on the first try? what is your total? and server memory/space is not all used up?
yes , total is 3718988 , ram 4gb .
mhh... please provide your trace?