Elasticquent icon indicating copy to clipboard operation
Elasticquent copied to clipboard

Processing a big data

Open alongmuaz opened this issue 9 years ago • 42 comments

Hi,

I have a few table in database, total size is 16 gigabytes ; 23,000,000 rows .

I deploy it on AWS ECE2 ; 1gb ram .

If anyone get an idea how to index all of it with fast as possible?

thank you

alongmuaz avatar Aug 23 '16 08:08 alongmuaz

I have the same situation... the model::addAllToIndex causes a 502 gateway, basically times out

scripta55 avatar Aug 25 '16 16:08 scripta55

the timeout error is cause by server, the server cant query all the data simultaneously , there is must a proper way to index the big data .

alongmuaz avatar Aug 26 '16 06:08 alongmuaz

I don't know better solution than using chunks for big data. I've got own question with example - you may take example from there

Dmitri10 avatar Sep 05 '16 09:09 Dmitri10

Hi @Dmitri10 , thanks for your reply , i have 23 millions of rows data , I try to add to index , 1 hour able to index around 10k record, can you imagine how many days to take for 23 millions of data?

alongmuaz avatar Sep 06 '16 07:09 alongmuaz

@alongmuaz 10 thousand in one hour sounds too long, i managed to sort my issue out, and this was the solution. 446 thousand records took about 3mins. This was in a command i could execute on the server by the way.

` /*******************************ORGS*************************************/ $orgsApi = new OrgsAPI(); $this->info('Starting elastic indexing....');

    $orgsApi->esCreateIndex();

    $orgCount = DB::table('orgs')->count();

    $this->info("Number of orgs: " . $orgCount);
    $this->info('Adding org documents...');
    $bar = $this->output->createProgressBar(1);

        $orgsAmountPerChunk  = 1000;
        $orgsApi->esLoadData($orgsAmountPerChunk);

        $this->comment("HP: Orgs Indexed");

        $bar->advance();

    $bar->finish();
    $this->info(' HP Orgs Indexing Complete!');

    $check          = new DailyChecks();
    $check->date    = date('Y-m-d');
    $check->process = "hp index - orgs";
    $check->save();`

scripta55 avatar Sep 06 '16 07:09 scripta55

Hi @scripta55 , do you run these code using php artisan tinker? thank you

alongmuaz avatar Sep 06 '16 08:09 alongmuaz

@alongmuaz yes, check this out: https://laravel.com/docs/5.3/artisan#writing-commands

scripta55 avatar Sep 06 '16 08:09 scripta55

@alongmuaz, @scripta55 is right - your operation is too long for 10k records. For example I've got extra logic

Route::get('fullRefreshIndex', function () {
    set_time_limit(0);
    ini_set('max_execution_time', 0);

    try {
        \App\Models\User::deleteIndex();
    } catch (\Elasticsearch\Common\Exceptions\Missing404Exception $e) {
        print_r('no such index');
    }

    try {
        \App\Models\User::createIndex();
    } catch (\Elasticsearch\Common\Exceptions\Missing404Exception $e) {
        print_r('cant create index');
    }

    $cards = \App\Models\CardType::whereStatus('active')->get();
    $cardsArray = [];
    $chunkSize = config('app.chunk_user_model');
    foreach ($cards as $card) {
        $cardsArray[] = $card->id;
    }
    $users = \App\Models\User::whereStatus('active')
        ->whereIn('card_type_id', $cardsArray);

    $users->chunk($chunkSize, function ($users) {
        print_r('new chunk');
        echo '<br/>';

//        $users->each(function ($user) {
//            $user->addToIndex();
//        });
/* just saw that it would be a little faster than looping */
         $users->addToIndex();
    });
});

and for 160k records it took 4 minutes... @scripta55, could you show us your function esLoadData please? $orgsApi->esLoadData($orgsAmountPerChunk);

Dmitri10 avatar Sep 06 '16 08:09 Dmitri10

@Dmitri10 this could be slowing it down?

    $users->each(function ($user) {
        $user->addToIndex();
    });

You can use the bulk function that basically takes the chunk and inserts into elastic, also i noticed the more you chunk the more time the worker takes to load into memory before inserting, so as low as possible as you can go on the chunk the better.

as for function in esLoadData:

ReviewsElasticSearch::chunk($chunkamount, function ($flights) { $flights->addToIndex(); });

scripta55 avatar Sep 06 '16 08:09 scripta55

@scripta55 Thanks. Yes, I updated my code before your answer (in comments) and showed another example with using addToIndex to builder 4 days ago but little chunks don't work faster for me, so if you use nothing new.. then it seems your server has better RAM 👍

Dmitri10 avatar Sep 06 '16 08:09 Dmitri10

@Dmitri10 possibly so, i ran this test on a virtual box with 2 gigs ram

you can directly index from sql to elastic, that could work for you? using JDBC it is what i intend to do at a later stage

scripta55 avatar Sep 06 '16 09:09 scripta55

@scripta55 Yes, I've got only 1 gb ram) thanks a lot for JDBC, I didn't know about it anything!

Dmitri10 avatar Sep 06 '16 09:09 Dmitri10

Enjoy! it would be a great solution without the overhead of laravel

scripta55 avatar Sep 06 '16 09:09 scripta55

@scripta55 , you mean Logstash - JDBC?

alongmuaz avatar Sep 06 '16 09:09 alongmuaz

this one https://github.com/jprante/elasticsearch-jdbc

scripta55 avatar Sep 06 '16 09:09 scripta55

There are cool examples with it, its standalone; it runs on your server/box independently

scripta55 avatar Sep 06 '16 09:09 scripta55

Cool library except one thing - there are too much opened issues now and you may catch one of them and wait fix...

Dmitri10 avatar Sep 06 '16 09:09 Dmitri10

@Dmitri10 true, however most of those issues are just lack of understanding from most users. Its a really straight forward solution to execute :) When i tried it out, i had lots of questions simply because my mindset was not on track with how it works and what it simply is supposed to do. give it a go!

scripta55 avatar Sep 06 '16 09:09 scripta55

Thanks one more time anyway! If i have free time I'll try it.

Dmitri10 avatar Sep 06 '16 09:09 Dmitri10

@scripta55 , where the OrgsAPI() referring to?

alongmuaz avatar Sep 07 '16 04:09 alongmuaz

@alongmuaz , It's his own class where he added his functions esCreateIndex(), esLoadData(), etc.

And he told you about main function for indexing data using model and chunks:

as for function in esLoadData: ReviewsElasticSearch::chunk($chunkamount, function ($flights) { $flights->addToIndex(); });

Dmitri10 avatar Sep 07 '16 08:09 Dmitri10

hai @scripta55 , i got this error when using your codes :

[Elasticsearch\Common\Exceptions\ServerErrorResponseException]

alongmuaz avatar Sep 08 '16 07:09 alongmuaz

hi @alongmuaz are you able to curl to you able to curl to the elastic server? from outside your homestead or http get from elastic

scripta55 avatar Sep 08 '16 08:09 scripta55

your result should be like this: http://d.pr/i/1hExn

scripta55 avatar Sep 08 '16 08:09 scripta55

hi @scripta55 , yes , actually according to aws es , theres data inserted into nodes . suddenly when running your codes after 10 minutes .

alongmuaz avatar Sep 08 '16 08:09 alongmuaz

http://192.168.10.10:9200/YOUR-INDEX-NAME/_count

please share the results you get from there. Also how many records are you intending to index into elastic?

scripta55 avatar Sep 08 '16 08:09 scripta55

http://d.pr/i/1cTMd

alongmuaz avatar Sep 08 '16 08:09 alongmuaz

did you load 2775000 on the first try? what is your total? and server memory/space is not all used up?

scripta55 avatar Sep 08 '16 08:09 scripta55

yes , total is 3718988 , ram 4gb .

alongmuaz avatar Sep 08 '16 08:09 alongmuaz

mhh... please provide your trace?

scripta55 avatar Sep 08 '16 08:09 scripta55