celery
celery copied to clipboard
MySQL result backend defaults to BLOB which may be too short for some results (LONGBLOB)
I was running into errors caused by loading truncated pickled results. Specifically, I would get an EOFError (see below). I eventually figured out that the result column of the celery_taskmeta MySQL table was not able to hold all of the pickled result data. By altering the table to use LONGBLOB instead of BLOB, I was able to deal with the large results (alter table celery_taskmeta modify result LONGBLOB). Is there a way to make this the default type in the future?
[2011-08-31 13:34:13,605: ERROR/PoolWorker-1]
Traceback (most recent call last):
File "/home/ubuntu/w/apis/trunk/apis/celerytasks/tasks/file_task_base.py", line 293, in process
self.process_records()
File "/home/ubuntu/w/apis/trunk/apis/celerytasks/tasks/proxmatch_file.py", line 381, in process_records
self.wait_results(async_result_buffer)
File "/home/ubuntu/w/apis/trunk/apis/celerytasks/tasks/proxmatch_file.py", line 435, in wait_results
async_result.wait()
File "/usr/local/lib/python2.6/dist-packages/celery/result.py", line 87, in wait
return self.get(*args, **kwargs)
File "/usr/local/lib/python2.6/dist-packages/celery/result.py", line 83, in get
interval=interval)
File "/usr/local/lib/python2.6/dist-packages/celery/backends/base.py", line 110, in wait_for
status = self.get_status(task_id)
File "/usr/local/lib/python2.6/dist-packages/celery/backends/base.py", line 206, in get_status
return self.get_task_meta(task_id)["status"]
File "/usr/local/lib/python2.6/dist-packages/celery/backends/base.py", line 227, in get_task_meta
meta = self._get_task_meta_for(task_id)
File "/usr/local/lib/python2.6/dist-packages/celery/backends/database.py", line 61, in _get_task_meta_for
task = session.query(Task).filter(Task.task_id == task_id).first()
File "/usr/local/lib/python2.6/dist-packages/sqlalchemy/orm/query.py", line 1770, in first
ret = list(self[0:1])
File "/usr/local/lib/python2.6/dist-packages/sqlalchemy/orm/query.py", line 1667, in __getitem__
return list(res)
File "/usr/local/lib/python2.6/dist-packages/sqlalchemy/orm/query.py", line 1960, in instances
rows = [process[0](row, None) for row in fetch]
File "/usr/local/lib/python2.6/dist-packages/sqlalchemy/orm/mapper.py", line 2600, in _instance
populate_state(state, dict_, row, isnew, only_load_props)
File "/usr/local/lib/python2.6/dist-packages/sqlalchemy/orm/mapper.py", line 2454, in populate_state
populator(state, dict_, row)
File "/usr/local/lib/python2.6/dist-packages/sqlalchemy/orm/strategies.py", line 150, in new_execute
dict_[key] = row[col]
File "/usr/local/lib/python2.6/dist-packages/celery/db/a805d4bd.py", line 55, in process
return loads(value)
EOFError
I think we'd prioritise small results first, as this is the most common use case, but if a blob is not big enough to hold "reasonably" sized results, then longblob wold be a good default.
Could we find out the current limit for blobs in mysql? What are the space implications for smaller results?
The MySQL manual offers this as the reason that an error was not raised when a pickled value longer than BLOB's max length is inserted:
http://dev.mysql.com/doc/refman/5.5/en/char.html
If strict SQL mode is not enabled and you assign a value to a CHAR or VARCHAR column that exceeds the column's maximum length, the value is truncated to fit and a warning is generated. For truncation of nonspace characters, you can cause an error to occur (rather than a warning) and suppress insertion of the value by using strict SQL mode. See Section 5.1.6, “Server SQL Modes”.
This page has the allowed dimensions:
http://dev.mysql.com/doc/refman/5.5/en/string-type-overview.html
BLOB 65,535 (216 – 1) bytes MEDIUMBLOB 16,777,215 (224 – 1) bytes LONGBLOB 4,294,967,295 or 4GB (232 – 1) bytes
I don't think there is much overhead in switching between these types since they only take up the space one actually uses plus a few bytes to track the length. So I wouldn't expect smaller results to be impacted.
Yeah, I know about strict mode, that it's not enabled by default is kind of unique to MySQL, and often criticised. It's one of the first things I enable on new MySQL installs.
16bits is very small indeed, we should definitely increase that.
It seems that a BLOB in Oracle is 2**32 by default: http://ss64.com/ora/syntax-datatypes.html
In PostgreSQL it seems that the bytea type is preferred, and that it is of unlimited size. Maybe
BLOB maps to this field directly. I couldn't find any information about whether any of the SQL standards
describe dimensions for the BLOB field type.
django-celery uses django-picklefield, and pickled data is stored in a text field
there: https://github.com/shrubberysoft/django-picklefield/blob/master/src/picklefield/fields.py
I checked django.db.backends.mysql.creation and TextFields uses LONGTEXT on MySQL.
picklefield also uuencodes the data though, and there is an issue open for the sqlalchemy backend to use the same method so the backends can be interchangeable: #183
I think maybe we should resolve issue #183 at the same time so we don't have to change the types that often.
Which means changing the BLOB to be a LONGTEXT
On further inspection this is behavior from SQLAlchemy, where sqlalchemy.types.PickleType is a LargeBinary field that resolves to a BLOB when using MYSQL. Sadly there is no generic data type in sqlalchemy that is larger than a LargeBlob and so if we used a LARGE_BLOB it would not be compatible with all other supported databases.
I think sqlalchemy must be lobbied to change the default large binary type for MySQL, or people depending on large results must alter the table :(
Is there a resolution on this before it was closed?
@ericraymond the database result backend still uses PickleType, which maps to a LargeBinary field in SQLAlchemy as far as I can tell. I suggest that you review the previous comments and if you can provide or propose a solution please feel free to open a new issue.
@georgepsarakis I see your point.
Celery could directly use the MySQL longblob type but that would break the abstraction that Celery uses sqlalchemy for multiple database dialects. The other alternatives are for SQLAlchemy to change the MySQL dialect mapping or create a new (largeblob) MySQL Dialect.
@ericraymond I think I came up with a possible solution. We can redefine the field as LargeBinary and handle the pickling part separately (before storing the result). From the documentation, you can pass a length parameter to the field definition, which MySQL will then map to the corresponding type (MEDIUMBLOB, LONGBLOB) that can contain binary strings of this length:
-- Set the maximum length to 4GB
mysql> CREATE TABLE results(serialized_result BLOB(4294967295));
Query OK, 0 rows affected (0.02 sec)
mysql> SHOW CREATE TABLE results\G
*************************** 1. row ***************************
Table: results
Create Table: CREATE TABLE `results` (
`serialized_result` longblob
) ENGINE=InnoDB DEFAULT CHARSET=latin1
1 row in set (0.00 sec)
What do you think?
Looks promising! Nice.
On Thu, Nov 29, 2018, 9:51 PM George Psarakis [email protected] wrote:
@ericraymond https://github.com/ericraymond I think I came up with a possible solution. We can redefine the field as LargeBinary https://docs.sqlalchemy.org/en/latest/core/type_basics.html#sqlalchemy.types.LargeBinary and handle the pickling part separately (before storing the result). From the documentation, you can pass a length parameter to the field definition, which MySQL will then map to the corresponding type (MEDIUMBLOB, LONGBLOB) that can contain binary strings of this length:
-- Set the maximum length to 4GB mysql> CREATE TABLE results(serialized_result BLOB(4294967295)); Query OK, 0 rows affected (0.02 sec)
mysql> SHOW CREATE TABLE results\G *************************** 1. row *************************** Table: results Create Table: CREATE TABLE
results(serialized_resultlongblob ) ENGINE=InnoDB DEFAULT CHARSET=latin1 1 row in set (0.00 sec)What do you think?
â You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/celery/celery/issues/461#issuecomment-443099032, or mute the thread https://github.com/notifications/unsubscribe-auth/AVcTN7N_0x3HYsOUSHF1EYFsriTRXYZnks5u0MdcgaJpZM4AIZ20 .
@auvipy Still working on this? If not I'd be happy to take a look.
please feel free
I am making progress here, but just wanted to let folks know that moving from BLOB to LARGEBLOB storage in MySQL will increase the storage overhead per row from from 2 to 4 bytes.
@remeika I can't imagine this being an issue (2 extra bytes for a row of multiple KB/MB). If you would like some guidance, you can open a draft Pull Request and we can discuss the implementation there.
Any progress on this. Can this be planned for a future release?
Hoping this is part of 5.1 release....!
Is this part of 5.2?
Commenting to show interest! For now, I've bypassed this feature by manually adjusting the column type in my migration files, but would really love to see it implemented!
Any progress? I am trying out a sql backend (MariaDB) but most tasks now fail due to this storage size.
I will revisit this after celery 5.5 release