orator icon indicating copy to clipboard operation
orator copied to clipboard

Model chunk uses "select * from table"

Open Flupster opened this issue 7 years ago • 20 comments

I have a table with > 10m rows so this fills the ram then fails

Also could not find any cursor options in the docs unless i'm blind?

Flupster avatar Sep 07 '18 10:09 Flupster

chunk is ran on the collection that is returned from the result, not the query itself

josephmancuso avatar Sep 07 '18 12:09 josephmancuso

I'll need to see the code you are using but you might be after something like:

Model.limit(100).get().chunk(10)

josephmancuso avatar Sep 07 '18 12:09 josephmancuso

This is what I feared, how would you go about iterating over a 10million row table?

I've created a workaround myself but looking for best practices

    first = Character.order_by('id', 'asc').first()
    last_id = first.id

    while True:
        characters = Character.order_by('id', 'asc').where('id', '>', last_id).take(100).get()
        for character in characters:
            pass

        last_id = characters[-1].id

I'm pretty sure Laravels Eloquent pulls the rows dynamically rather than all then iterate

Flupster avatar Sep 07 '18 13:09 Flupster

are you trying to create a generator here?

josephmancuso avatar Sep 07 '18 13:09 josephmancuso

are you saying you don't want to store 10M rows in memory?

josephmancuso avatar Sep 07 '18 13:09 josephmancuso

or you don't want to fetch all 10M rows from the database?

josephmancuso avatar Sep 07 '18 13:09 josephmancuso

Chunking the result would imply that you're pulling n rows at a time from the database, that's what I think personally

Model.chunk(100) would pull 100 rows, but you can loop through that and offset by the rows

select * from table
{query here if exists, ie where x = y}
order by {primary key} asc
limit 100 offset ({iterator}-1)*100

when you reach the end (0 results) you just break out and chunk becomes None

I don't know if i'm explaining it well but with this method you wouldn't lock the application up while you iterate over every entry in a large table

My way would start processing rows immediately The way it works at the moment (select * from table;) would have to wait 10minutes to pull the entire table down into memory (if you have enough)

Flupster avatar Sep 08 '18 06:09 Flupster

Is there anyway I can get a copy of a 10M row database? I'm wondering if a generator would work here

josephmancuso avatar Sep 10 '18 14:09 josephmancuso

I'd prefer not to send a 4GB sql file, not only would it take forever to import but it's just way easier to generate a new table and seed it with data?

Doesn't Orator support seeding?

Flupster avatar Sep 10 '18 19:09 Flupster

have you tried using a generator, yet?

def func():
    yield Model.all().chunk(2)

this won't load all 10M rows into memory since it's a generator

josephmancuso avatar Sep 11 '18 01:09 josephmancuso

Sorry for delay, this still calls select * from table

ID      USER      HOST	        DB      COMMAND	TIME	STATE	        Info	                        TIME_MS
787187	evequi    localhost	evequi	Query	5	Sending data	SELECT * FROM `characters`	5005.429
def testing():
    yield Character.all().chunk(100)

if __name__ == "__main__":
    for characters in testing():
        print(characters)

Flupster avatar Sep 17 '18 03:09 Flupster

hmm idk. Try

    for characters in Character.chunk(100):
        print(characters)

josephmancuso avatar Sep 17 '18 03:09 josephmancuso

unless you said you tried that already? If the above code snippet doesn't work then Im not sure orator supports what you are trying to do

josephmancuso avatar Sep 17 '18 03:09 josephmancuso

all() will execute a SELECT * FROM table and return a Collection object.

What you want is actually:

for characters in Character.order_by('id', 'asc').chunk(100):
    # Do something with the characters

See the documentation: https://orator-orm.com/docs/0.9/orm.html#chunking-results and https://orator-orm.com/docs/0.9/query_builder.html#chunking-results-from-a-table

sdispater avatar Sep 17 '18 19:09 sdispater

We've gone full circle

    for characters in Character.order_by('id', 'asc').chunk(100):
        print(characters)

this runs a SELECT * FROM characters, once again loading all 10m rows into memory before chunking

Flupster avatar Sep 18 '18 10:09 Flupster

but you are incorrect because that returns a generator. Generators do not load everything into memory. They pull results as they are needed and as you are iterating over them. That is why they are called generators because they generate the results as they are needed.

this code section:

for characters in Character.order_by('id', 'asc').chunk(100):
        print(characters)

does not load 10M rows into memory. It pulls 10M rows from the database but stores it in a generator which is not stored in memory.

https://wiki.python.org/moin/Generators

josephmancuso avatar Sep 18 '18 14:09 josephmancuso

You can close this if you what, but as it stands the chunk helper is absolutely useless, you say it doesn't get stored in memory which is insane that you even think that

https://i.imgur.com/g1XFWx8.png

You need to re implement chunking or a cursor for ways to iterate over a database WITHOUT consuming the entire table to memory, it took 5 minutes to get that screenshot, if we did this in laravel it would take milliseconds

Again, I'm done trying to help from here and good luck but closing an issue because you can't be bothered to fix it is stupid

And I quote:

Chunking Results
If you need to process a lot of records, you can use the chunk method to avoid consuming a lot of RAM:

Flupster avatar Sep 19 '18 07:09 Flupster

a generator doesn't necessarily store anything (relatively) into memory. Storing something like the results of 52! takes up like 500 bytes of memory. The problem here might be that getting those values into the generator in the first place is taking up a lot of memory. Any problems can likely be solved in here https://github.com/sdispater/orator/blob/68266dd62b795a34c85843dd902883ff7fc50d93/orator/orm/builder.py#L225

@sdispater and I will gladly look at any pull requests fixing this issue.

josephmancuso avatar Sep 19 '18 12:09 josephmancuso

@Flupster Orator is already using cursors and the fetchmany() method when using chunk().

https://github.com/sdispater/orator/blob/0.9/orator/query/builder.py#L1163

https://github.com/sdispater/orator/blob/0.9/orator/connections/connection.py#L221

So, I am not sure what's going on here.

And unless I missed it we are missing a critical piece of information here: what is the RDBM you are using?

sdispater avatar Sep 19 '18 15:09 sdispater

i wish a method get all data by one to one, dont once get all data. and can get result for loop,just like:

select * from table limit 10, 0
select * from table limit 10, 10
select * from table limit 10, 20
...

mouday avatar Dec 16 '19 06:12 mouday