procrastinate icon indicating copy to clipboard operation
procrastinate copied to clipboard

Support large args

Open jakajancar opened this issue 6 months ago • 11 comments

Closes #1313

This PR:

  • Limits the argument length to 100 when rendered in logs and in Django Admin
  • Adds a full-page args view to Django Admin

Before:

Screenshot 2025-05-08 at 09 11 58 Screenshot 2025-05-08 at 09 11 59
  • list view: the status and summary columns are pushed off-screen
  • list view: the row can get very tall, depending on the JSON passed to the argument
  • list view: if you have a 10MB arg that will be a very large HTML page
  • detail view: the title can get arbitrarily large

After:

Screenshot 2025-05-08 at 09 11 35 Screenshot 2025-05-08 at 09 11 37 Screenshot 2025-05-08 at 09 11 38

@ewjoachim This is slightly different than your original suggestion in #1313. I noticed that Job does not currently reference App and it seemed ugly to make it aware just for logging customization. Do we really need this to be customizable? Happy to still do it if you want.

Successful PR Checklist:

  • [x] Tests
    • [x] Not really tested now. Happy to add if we think something is "public API"
  • [x] Documentation
    • [x] The exact syntax of the log name is undocumented now, and I think this is best. Structured logs should be used, not parsing of call_string.

jakajancar avatar May 08 '25 16:05 jakajancar

Coverage report

Click to see where and how coverage changed
FileStatementsMissingCoverageCoverage
(new stmts)
Lines missing
  procrastinate
  jobs.py
  utils.py 343
Project Total  

This report was generated by python-coverage-comment-action

github-actions[bot] avatar May 08 '25 18:05 github-actions[bot]

@ewjoachim I'm tried to add some test that checks that log context does not grow infinitely with args size and am noticing some things:

  • First types.JSONValue is kinda misleading, as it has dict[str, t.Any] and list[t.Any], not dict[str, JSONValue] and list[JSONValue]. Not sure if it's intentional that JSON-ness is only guaranteed at the first level, or a limitation of the type system?
  • log_context returns (as is indicated in type) non-JSON data. Is this expected?
  • Either way, it's not enough that we only remove the mega-arg from call_string, must also do something with it in the log_context. What are your thoughts?

jakajancar avatar May 08 '25 18:05 jakajancar

First types.JSONValue is kinda misleading, as it has dict[str, t.Any] and list[t.Any], not dict[str, JSONValue] and list[JSONValue]. Not sure if it's intentional that JSON-ness is only guaranteed at the first level, or a limitation of the type system?

At the time this was written, recursive types weren't supported. See https://github.com/python/typing/issues/182

But it's not the case anymore so feel free to fix the types (preferably in another PR, I suspect the CI will be complaining here and there).

log_context returns (as is indicated in type) non-JSON data. Is this expected?

You mean scheduled_at ? No, you're probably right, the typing is probably wrong. I think it should be a json, so we should probably isoformat() that datetime. Or do you have something else in mind ?

Either way, it's not enough that we only remove the mega-arg from call_string, must also do something with it in the log_context. What are your thoughts?

That's something that comes up a lot: I'm a maintainer of procrastinate (and the developer of a large part of it), but it's been ages since I last was a user of Procrastinate in any real life setting, so the points users like you bring up are usually kind of my best bet to understand and be aware of pain points. All in all, I would say the point you bring up seems very reasonable, and we should probably be mindful about the size of the logs.

That said, I think it would make sense that the complete args of each task is logged at least once when it's deferred and once when it's executed (probably upon completion or at the start of the task). That said, I'm ok if that's only in DEBUG logs.

Maybe something like: if we detect that the info logs we emit are truncated, we emit a debug log with the full value (once when deferred, once when executed) (trying to think if it's more useful to log before completion so if the computation crashes the whole process, we can figure out what exactly the task was executing, or after completion so we have the associated result. I think before ?) ?

ewjoachim avatar May 08 '25 19:05 ewjoachim

That's something that comes up a lot: I'm a maintainer of procrastinate (and the developer of a large part of it), but it's been ages since I last was a user of Procrastinate in any real life setting

Yikes 😅

log_context returns (as is indicated in type) non-JSON data. Is this expected?

You mean scheduled_at ? No, you're probably right, the typing is probably wrong. I think it should be a json, so we should probably isoformat() that datetime. Or do you have something else in mind ?

test_nominal demonstrates that in task_kwargs you can get arbitrary objects (in this case Param).

At some point arbitrary types need to get converted into JSON for storage in the database. I guess it's unclear to me if task_kwargs is supposed to contain the original types or JSON-ified types.

  • If original:
    • Type should not be JSONDict (in the recursive type sense) but dict[str, Any]
    • call_string needs to use repr(value)
    • Log context needs to either:
      • do nothing, which means they cannot be limited in length (and note that by default structlog will just drop entire kwargs in such case as unloggable), or
      • need to contain repr(value)
      • the conversion to JSON needs to happen when creating the context
  • If JSON:
    • Type can remain JSONDict
    • call_string and log context have some flexibility in how to serialize or (potentially even recursively) ellipsize the data

jakajancar avatar May 08 '25 21:05 jakajancar

This somewhat perfectly captures my opinion of 2025 so far :)

https://github.com/user-attachments/assets/cd354f60-2c78-4e70-88bd-9d16fe0a82c7

Joke apart, let me know if you need any help.

ewjoachim avatar May 10 '25 11:05 ewjoachim

@ewjoachim Yes, now you understand the problem this PR is solving :)

Tbh I have kind of given up on both approaches:

  • automatic limit for all args, because of mis-typing/unclearly defined types as described above,
  • unlogged_args since it’s inpossible to get Task from Job.

I will most likely go with an extra table to hold the large data and only reference it’s PK in the task, then delete the row at the end of the job.

jakajancar avatar May 10 '25 12:05 jakajancar

I understand you have to fix pressing issues, and I respect that.

That said, I still think it makes sense, either for you or someone else, in due time, to continue working on this so that at some point, it stops being an issue for other people ?

I'm not saying you have to, but first and foremost thank you for trying, and then: maybe someone will want to pick this work up where it was left :) (or maybe you'll want to, though no pressure)

ewjoachim avatar May 10 '25 12:05 ewjoachim

Ok so: I've started working on this to get a better hang on the remaining issues:

  • Recursive JSON has been implemented, though indeed, accurate types mean we'll have to change the types of task kwargs to say it might be any object. That means anytime we need to json encode it, we'll need access to the json_dumps function that was provided to the connector.
  • There are a few shenanigans to bridge the gap between Job and Task. But the reason for this is that when we defer a job, or use the shell, we may not have the code of the task. It's perfectly possible that the worker is the only part of our system that knows about the details of a task. That's not necessarily an issue, we can do it on best effort, but that's why there's no direct link between job and task.

What smells bad for me is that if we json dumps all args so that we can truncate them to length, then arguments that were json encoded into a string (so mostly strings, but possibly other things too depending on the json_dumps function) become '"value"' with 2 sets of quote, and that looks bad.

I could check if after json_dumps, the string starts and ends with " and remove those, but I'm not a huge fan of this solution. I fear there's a bad design choice here that's going to backfire at some point.

Also, but that's kinda logical with the choices so far: if we truncate objects, then in the logs, all objects will become strings, so instead of logging {"task_kwargs": {"a": ["b"]} we'll log {"task_kwargs": {"a": "[\"b\"]"} and that's much less readable.

All in all, there's at least a decision to make:

  • A/ Do nothing, and let this length problem be
  • B/ Find something smart to do about it, which we haven't found yet
  • C/ Fix it in a consistent way that will make things more ugly for all users
  • D/ Fix it so that we don't change anything for users with only short parameters, and accept that it will sometimes be inconsistent at least by default, e.g. if a user has a job for which the arguments are sometimes shorter, sometimes longer than the default threshold. For those users, offer the customization hooks

I vote for B, so I welcome smartness from everyone around :D

ewjoachim avatar May 11 '25 08:05 ewjoachim

  • A/ Do nothing, and let this length problem be
  • B/ Find something smart to do about it, which we haven't found yet
  • C/ Fix it in a consistent way that will make things more ugly for all users
  • D/ Fix it so that we don't change anything for users with only short parameters, and accept that it will sometimes be inconsistent at least by default, e.g. if a user has a job for which the arguments are sometimes shorter, sometimes longer than the default threshold. For those users, offer the customization hooks

I don't like D) because of the inconsistencies. For A) it seems we are already too deep in it 😉 ... I would say it is the last resort (and we are not there yet). Regarding C), I do not care that much about the ugliness of log outputs. Logs always look a bit ugly to me, and it's more important that the information I need is there. That said, I still would prefer that large args are (somehow configurable) shortened. B), it is always good to come up with a better solution 😉 (but sometimes there isn't one).

So, to be more concrete, what would I prefer:

  • By default, have large args in the admin and the log output shortened.
  • Have a way in the admin interface to view the full args.
  • Some global configuration or environment variable to show the full args in the log output (turned off by default).
  • Use the same json_dumps encoder set in the connector to show the args in the admin and log output.
  • If no special encoder is used (relates to most users, I guess), we can try to prettify the output somehow (recursively go through all values).
  • If a custom encoder is set, just shorten the resulting args string as we can't know what the resulting format is (this must then be documented).

Not sure if this all makes sense, as you are much deeper into it (never used or touched the custom encoder stuff).

medihack avatar May 11 '25 11:05 medihack

If a custom encoder is set, just shorten the resulting args string as we can't know what the resulting format is (this must then be documented).

I'm not sure this will be helpful if you have a list with 10000 short strings :(

ewjoachim avatar May 11 '25 14:05 ewjoachim

For A) it seems we are already too deep in it 😉

🤣

Ignoring the details of the shortening and the presentation of the admin page for a second, what would make most sense to me is to convert the kwargs to JSON early, so that job.task_kwargs and consequently call_string, log_context, and anything else can rely on that invariant. This to me makes things conceptually clean, as job.task_kwargs will always contain the same type, both when initially created as well as when round-tripped through the database.

The conceptual unclarity and the amount of heart surgery was too much for me to undertake, but if you decide to do this change, I'm happy to pick up the trimming itself again.

jakajancar avatar May 12 '25 00:05 jakajancar