celery icon indicating copy to clipboard operation
celery copied to clipboard

MySQL result backend defaults to BLOB which may be too short for some results (LONGBLOB)

Open ghost opened this issue 14 years ago • 20 comments
trafficstars

I was running into errors caused by loading truncated pickled results. Specifically, I would get an EOFError (see below). I eventually figured out that the result column of the celery_taskmeta MySQL table was not able to hold all of the pickled result data. By altering the table to use LONGBLOB instead of BLOB, I was able to deal with the large results (alter table celery_taskmeta modify result LONGBLOB). Is there a way to make this the default type in the future?

[2011-08-31 13:34:13,605: ERROR/PoolWorker-1] 
Traceback (most recent call last):
  File "/home/ubuntu/w/apis/trunk/apis/celerytasks/tasks/file_task_base.py", line 293, in process
    self.process_records()
  File "/home/ubuntu/w/apis/trunk/apis/celerytasks/tasks/proxmatch_file.py", line 381, in process_records
    self.wait_results(async_result_buffer)
  File "/home/ubuntu/w/apis/trunk/apis/celerytasks/tasks/proxmatch_file.py", line 435, in wait_results
    async_result.wait()
  File "/usr/local/lib/python2.6/dist-packages/celery/result.py", line 87, in wait
    return self.get(*args, **kwargs)
  File "/usr/local/lib/python2.6/dist-packages/celery/result.py", line 83, in get
    interval=interval)
  File "/usr/local/lib/python2.6/dist-packages/celery/backends/base.py", line 110, in wait_for
    status = self.get_status(task_id)
  File "/usr/local/lib/python2.6/dist-packages/celery/backends/base.py", line 206, in get_status
    return self.get_task_meta(task_id)["status"]
  File "/usr/local/lib/python2.6/dist-packages/celery/backends/base.py", line 227, in get_task_meta
    meta = self._get_task_meta_for(task_id)
  File "/usr/local/lib/python2.6/dist-packages/celery/backends/database.py", line 61, in _get_task_meta_for
    task = session.query(Task).filter(Task.task_id == task_id).first()
  File "/usr/local/lib/python2.6/dist-packages/sqlalchemy/orm/query.py", line 1770, in first
    ret = list(self[0:1])
  File "/usr/local/lib/python2.6/dist-packages/sqlalchemy/orm/query.py", line 1667, in __getitem__
    return list(res)
  File "/usr/local/lib/python2.6/dist-packages/sqlalchemy/orm/query.py", line 1960, in instances
    rows = [process[0](row, None) for row in fetch]
  File "/usr/local/lib/python2.6/dist-packages/sqlalchemy/orm/mapper.py", line 2600, in _instance
    populate_state(state, dict_, row, isnew, only_load_props)
  File "/usr/local/lib/python2.6/dist-packages/sqlalchemy/orm/mapper.py", line 2454, in populate_state
    populator(state, dict_, row)
  File "/usr/local/lib/python2.6/dist-packages/sqlalchemy/orm/strategies.py", line 150, in new_execute
    dict_[key] = row[col]
  File "/usr/local/lib/python2.6/dist-packages/celery/db/a805d4bd.py", line 55, in process
    return loads(value)
EOFError

ghost avatar Aug 31 '11 19:08 ghost

I think we'd prioritise small results first, as this is the most common use case, but if a blob is not big enough to hold "reasonably" sized results, then longblob wold be a good default.

Could we find out the current limit for blobs in mysql? What are the space implications for smaller results?

ask avatar Aug 31 '11 21:08 ask

The MySQL manual offers this as the reason that an error was not raised when a pickled value longer than BLOB's max length is inserted:

http://dev.mysql.com/doc/refman/5.5/en/char.html

If strict SQL mode is not enabled and you assign a value to a CHAR or VARCHAR column that exceeds the column's maximum length, the value is truncated to fit and a warning is generated. For truncation of nonspace characters, you can cause an error to occur (rather than a warning) and suppress insertion of the value by using strict SQL mode. See Section 5.1.6, “Server SQL Modes”.

ghost avatar Aug 31 '11 21:08 ghost

This page has the allowed dimensions:

http://dev.mysql.com/doc/refman/5.5/en/string-type-overview.html

BLOB 65,535 (216 – 1) bytes MEDIUMBLOB 16,777,215 (224 – 1) bytes LONGBLOB 4,294,967,295 or 4GB (232 – 1) bytes

I don't think there is much overhead in switching between these types since they only take up the space one actually uses plus a few bytes to track the length. So I wouldn't expect smaller results to be impacted.

ghost avatar Aug 31 '11 21:08 ghost

Yeah, I know about strict mode, that it's not enabled by default is kind of unique to MySQL, and often criticised. It's one of the first things I enable on new MySQL installs.

16bits is very small indeed, we should definitely increase that.

It seems that a BLOB in Oracle is 2**32 by default: http://ss64.com/ora/syntax-datatypes.html In PostgreSQL it seems that the bytea type is preferred, and that it is of unlimited size. Maybe BLOB maps to this field directly. I couldn't find any information about whether any of the SQL standards describe dimensions for the BLOB field type.

django-celery uses django-picklefield, and pickled data is stored in a text field there: https://github.com/shrubberysoft/django-picklefield/blob/master/src/picklefield/fields.py I checked django.db.backends.mysql.creation and TextFields uses LONGTEXT on MySQL.

picklefield also uuencodes the data though, and there is an issue open for the sqlalchemy backend to use the same method so the backends can be interchangeable: #183

I think maybe we should resolve issue #183 at the same time so we don't have to change the types that often. Which means changing the BLOB to be a LONGTEXT

ask avatar Sep 01 '11 12:09 ask

On further inspection this is behavior from SQLAlchemy, where sqlalchemy.types.PickleType is a LargeBinary field that resolves to a BLOB when using MYSQL. Sadly there is no generic data type in sqlalchemy that is larger than a LargeBlob and so if we used a LARGE_BLOB it would not be compatible with all other supported databases.

I think sqlalchemy must be lobbied to change the default large binary type for MySQL, or people depending on large results must alter the table :(

ask avatar Aug 30 '12 12:08 ask

Is there a resolution on this before it was closed?

ericraymond avatar Nov 28 '18 22:11 ericraymond

@ericraymond the database result backend still uses PickleType, which maps to a LargeBinary field in SQLAlchemy as far as I can tell. I suggest that you review the previous comments and if you can provide or propose a solution please feel free to open a new issue.

georgepsarakis avatar Nov 29 '18 06:11 georgepsarakis

@georgepsarakis I see your point.

Celery could directly use the MySQL longblob type but that would break the abstraction that Celery uses sqlalchemy for multiple database dialects. The other alternatives are for SQLAlchemy to change the MySQL dialect mapping or create a new (largeblob) MySQL Dialect.

ericraymond avatar Nov 29 '18 09:11 ericraymond

@ericraymond I think I came up with a possible solution. We can redefine the field as LargeBinary and handle the pickling part separately (before storing the result). From the documentation, you can pass a length parameter to the field definition, which MySQL will then map to the corresponding type (MEDIUMBLOB, LONGBLOB) that can contain binary strings of this length:

-- Set the maximum length to 4GB
mysql> CREATE TABLE results(serialized_result BLOB(4294967295));
Query OK, 0 rows affected (0.02 sec)

mysql> SHOW CREATE TABLE results\G
*************************** 1. row ***************************
       Table: results
Create Table: CREATE TABLE `results` (
  `serialized_result` longblob
) ENGINE=InnoDB DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

What do you think?

georgepsarakis avatar Nov 30 '18 05:11 georgepsarakis

Looks promising! Nice.

On Thu, Nov 29, 2018, 9:51 PM George Psarakis [email protected] wrote:

@ericraymond https://github.com/ericraymond I think I came up with a possible solution. We can redefine the field as LargeBinary https://docs.sqlalchemy.org/en/latest/core/type_basics.html#sqlalchemy.types.LargeBinary and handle the pickling part separately (before storing the result). From the documentation, you can pass a length parameter to the field definition, which MySQL will then map to the corresponding type (MEDIUMBLOB, LONGBLOB) that can contain binary strings of this length:

-- Set the maximum length to 4GB mysql> CREATE TABLE results(serialized_result BLOB(4294967295)); Query OK, 0 rows affected (0.02 sec)

mysql> SHOW CREATE TABLE results\G *************************** 1. row *************************** Table: results Create Table: CREATE TABLE results ( serialized_result longblob ) ENGINE=InnoDB DEFAULT CHARSET=latin1 1 row in set (0.00 sec)

What do you think?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/celery/celery/issues/461#issuecomment-443099032, or mute the thread https://github.com/notifications/unsubscribe-auth/AVcTN7N_0x3HYsOUSHF1EYFsriTRXYZnks5u0MdcgaJpZM4AIZ20 .

ericraymond avatar Nov 30 '18 09:11 ericraymond

@auvipy Still working on this? If not I'd be happy to take a look.

remeika avatar Oct 06 '19 14:10 remeika

please feel free

auvipy avatar Oct 06 '19 15:10 auvipy

I am making progress here, but just wanted to let folks know that moving from BLOB to LARGEBLOB storage in MySQL will increase the storage overhead per row from from 2 to 4 bytes.

remeika avatar Oct 09 '19 03:10 remeika

@remeika I can't imagine this being an issue (2 extra bytes for a row of multiple KB/MB). If you would like some guidance, you can open a draft Pull Request and we can discuss the implementation there.

georgepsarakis avatar Oct 09 '19 05:10 georgepsarakis

Any progress on this. Can this be planned for a future release?

ericraymond avatar Sep 21 '20 18:09 ericraymond

Hoping this is part of 5.1 release....!

ericraymond avatar Apr 02 '21 05:04 ericraymond

Is this part of 5.2?

ericraymond avatar Aug 11 '21 19:08 ericraymond

Commenting to show interest! For now, I've bypassed this feature by manually adjusting the column type in my migration files, but would really love to see it implemented!

alexpotv avatar Aug 19 '24 23:08 alexpotv

Any progress? I am trying out a sql backend (MariaDB) but most tasks now fail due to this storage size.

mvhconsult avatar Jan 31 '25 11:01 mvhconsult

I will revisit this after celery 5.5 release

auvipy avatar Feb 01 '25 07:02 auvipy