flux-core
flux-core copied to clipboard
implement prototype job-db module
Problem: the job-list
module's memory cache grows without bound, and tracks KVS which we need to start pruning. The job-archive
module is mainly structured to feed data to flux-accounting
and offers no on-line query support.
We want to answer the question: is it feasible to develop a new module that answers job-list
queries exactly as the job-list
module does now, but instead of only caching jobs in memory, retains inactive job data in a persistent database?
Some notes from our discussions
- It could be called
job-db
- It could register the
job-list
service so everything just works if it is loaded in place ofjob-list
(evenjob-archive
as is) - It could consume the job manager journal and maintain an in-memory cache of active jobs
- Upon job transition to INACTIVE state, job data (now immutable) could be added to a sqlite database and dropped from the memory cache.
- Queries would seamlessly access the cache as well as the database
- Sqlite JSON capability may be of use for answering queries against R and jobspec without needing to break fields out to separate rows (see #4234)
The goal of prototyping here would be to quickly get something functional that we can discuss.
Automated testing could be deferred until we decide whether to move forward.
some random thoughts, not necessarily for this prototype, but just thoughts for later (if this works out):
- currently in
job-archive
we archive eventlog, jobspec, R, jobid, userid, ranks, and timestamps. - I believe userid, ranks, and timestamps can be retrieved via eventlog or R. We could make the decision to not bother to archive / store all "helper" / "convenience" fields for speed / space. this may be reasonable given #4234.
- we do not currently archive guest.exec.eventlog or stdio. We probably never want to archive stdio, but maybe guest.exec.eventlog?
- we don't currently archive annotations (update: and memos), which maybe we'd want to.
A suggestion would be to leave off guest.exec.eventlog
, stdio, annotations, and memos for this experiment. We can always add later.
Also, a simplifying assumption may be that (unlike job-list
) no kvs scan is needed on restart. I think we can make that work by adding some way to flush the job manager journal to consumers and end with an ENODATA during the broker CLEANUP state, before rc3 runs.
I hit a snag during my prototyping today and thought I'd do a thought dump before moving forward:
- the major snag: the job-list/job-db module can be sent the journal event for "job inactive" (eventlog "clean" event) BEFORE the eventlog "clean" event is written to the KVS. So there's a race condition here, which could lead to further complications down the road.
- e.g. i thought if I implemented a "inactive cache list", that might solve the problem, i.e. don't dump to archive until later once you know all the data is in the KVS. But if there was a system crash, we might lose inactive data b/c that data hasn't been flushed to the archive yet. But we gotta make sure data is flushed to the archive before we can remove data from the KVS, yadda yadda yadda.
- EDIT: I occurred to me, this race probably exists within the
job-archive
module right now. Just given the "period" and what not withjob-archive
, the probability is extremely low of hitting it. This is in contrast to thejob-db
prototype, where we get the job eventlog immediately after getting the journal entry saying we're inactive.
notes on technical things to resolve
-
will have to redo how we do job stats, probably not hard, anything related to the inactive job list has to be converted into a sql query probably (could be complicated with inactive cache list given above).
-
currently the job-archive errors out if no statedir is specified. how to deal with this with a
job-db
module? obviously don't want to error out and have no job-listing?? TBD. -
it is quite annoying to "re-parse" eventlog / R / jobspec to "re-get" all job information for job-list. Would like to consider putting way more into the archive db for easier retrieval of data. Wish we had a nosql database we could just add columns as needed.
other thoughts / notes:
-
not 100% convinced we gotta make a new
job-db
module. archive could possibly be added directly intojob-list
. -
maybe can nix some
job-list
stuff as a result (idsync? list-inactive?) TBD -
memo / annotations, I think we discussed consolidation of this at some point in the past but I'm forgetting conversations. May be better to deal with this first.
overall:
-
although I haven't implemented something close to a working prototyping, I think this plan is doable. It's just going to take work than perhaps we originally realized.
-
as I was pondering solutions, one idea, would a "archive coordinator" module make sense. i.e. monitor what jobs are in the archive, tell job-list to flush cache entries it knows are in the archive, cleanup kvs inactive entries it knows are in the archive. not sure if it's a good idea, just a thought.
currently the job-archive errors out if no statedir is specified. how to deal with this with a job-db module? obviously don't want to error out and have no job-listing?? TBD.
I would have to ponder some of the other questions more, however, on this specific issue, sqlite has an in memory db mode which could be used if there is no statedir or if we want to default to no backing file.
Wish we had a nosql database we could just add columns as needed.
I'm definitely not an expert, but when researching JSON columns a lot of SO posts were describing how they use these columns as a cheap nosql database, or later add new columns/indexes on demand when necessary. Might want to do more research on that.
the major snag: the job-list/job-db module can be sent the journal event for "job inactive" (eventlog "clean" event) BEFORE the eventlog "clean" event is written to the KVS. So there's a race condition here, which could lead to further complications down the road.
As an experiment, instead of reading the eventlog
from the KVS I simply rebuilt the eventlog
from all the journal entries sent to me. It was uglier than expected. But might do the trick.
A better solution might be to re-work the journaling, to only send journal updates after eventlogs are written to the KVS. that was the far trickier option at the moment, that chunk of code in the job-manager is quite tricky. See https://github.com/flux-framework/flux-core/issues/3931
I was going to suggest just reconstructing the event log from the journal. What did the problem turn out to be with that?
On Mon, Apr 11, 2022, 4:30 PM Al Chu @.***> wrote:
the major snag: the job-list/job-db module can be sent the journal event for "job inactive" (eventlog "clean" event) BEFORE the eventlog "clean" event is written to the KVS. So there's a race condition here, which could lead to further complications down the road.
As an experiment, instead of reading the eventlog from the KVS I simply rebuilt the eventlog from all the journal entries sent to me. It was uglier than expected. But might do the trick.
A better solution might be to re-work the journaling, to only send journal updates after eventlogs are written to the KVS. that was the far trickier option at the moment, that chunk of code in the job-manager is quite tricky. See #3931 https://github.com/flux-framework/flux-core/issues/3931
— Reply to this email directly, view it on GitHub https://github.com/flux-framework/flux-core/issues/4273#issuecomment-1095691093, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABJPWYVBSGBGCSPVRLTP2LVESYY7ANCNFSM5SWPL7PA . You are receiving this because you authored the thread.Message ID: @.***>
What did the problem turn out to be with that?
Mostly just hitting corner cases :-) It'd be easier if I could just grab the eventlog, vs having to re-construct it ... and via the job-list restart.
What if instead of trying to make sure the journal events are delayed until the associated kvs commit is complete, we added an entry to the journal for jobs after the final entry is committed to the eventlog? Actually, isn't this already implied by the job-state events that are sent out? Could you use that to trigger a kvs lookup of all job data to be archived?
Edit: I mean if we ever want to store the exec eventlog or output eventlogs to allow this data to be queryable via the database, we'll have to get access to things other than the main eventlog anyway...
What if instead of trying to make sure the journal events are delayed until the associated kvs commit is complete, we added an entry to the journal for jobs after the final entry is committed to the eventlog? Actually, isn't this already implied by the job-state events that are sent out? Could you use that to trigger a kvs lookup of all job data to be archived?
I was thinking about that as an option too, although I thought of writing it out to the eventlog. Just having some "flush" or something event in the journal only makes more sense.
I didn't pursue it for this prototype as I couldn't quite figure out how to easily add it in the job-manager events stuff.
This prototype I'm doing is also all spaghetti code, so it might not be as bad if I properly refactored things vs just cutting and pasting things lazily.
It'd be easier if I could just grab the eventlog, vs having to re-construct it ... and via the job-list restart.
I didn't quite understand what you mean by via the job-list restart. I was presuming job-db would not need to scan the KVS on restart since it already has all the inactive jobs so far...
I didn't quite understand what you mean by via the job-list restart. I was presuming job-db would not need to scan the KVS on restart since it already has all the inactive jobs so far...
oh haha, yeah, you're right. I just haven't gotten to the restart stuff yet, hit bugs and went along trying to fix them as i hit them :-)
It'll probably end up being less annoying that I originally anticipated.
I finally got an experimental branch that completes tests in sharness (note completes, not passes :-). This prototype is garbage of course, tons of memleaks, tons of lazy cut & paste, no refactoring. A nice chunk of tests don't pass as there's some complications or bug I just decided I didn't want solve or dig into for the time being.
https://github.com/chu11/flux-core/tree/issue4273_job_db_prototype
some random notes
-
job-list
needs some refactoring as the code architecturally struggles with doing whatjob-db
wants to do. For example, in this prototype, I actually put inactive jobs on theinactive
list, then I take it off later when I write to the archive. How to refactor, TBD. -
I still need to scan the KVS on restart b/c
job-list
builds up its internal list of pending / running jobs from this vs. the journal. Shouldn't be a big deal longer term when we start to remove inactive jobs from the KVS.
some thoughts:
-
I like @grondo's suggestion that we do a "in memory" database when a
statedir
is not configured (or perhaps a flag indicating "archive to sqlite"). In this hacky prototype, I had to change a ton of tests to specify astatedir
, it's annoying. In addition, I wasn't even sure how to setstatedir
in some cases. e.g. how to specify unique statedir to each sub flux instance when using--cc
. -
OR based on a config, we could just keep the current
inactive
list approach if archiving is not required. Although I think that will make things more complicated. -
I like @grondo's idea of perhaps storing tons of stuff in json and putting it into a JSON field within the archive. That would make this semi-nosql and make parsing data out of the archive a lot easier. Although we could probably (and perhaps should) create more columns for the commonly accessed stuff.
Next steps:
In terms of "is this reasonable", I think it is. Although I'm struggling with next steps.
-
some ideas:
-
Mixing the
job-list
andjob-archive
into one is some of the struggle in this prototype. So one idea I had was we could have a configurable "job archive" module / store configuration, similar to the broker's "backing store" module / configuration. One of the configurations could simply be "in memory job archive" (e.g. could just be azhashx
or similar).- This could be a reasonable "middle point", allowing us to keep
job-list
andjob-archive
separate from each other and move forward without insane changes. - Could lead to different "archive stores" in the future (sqlite, or nosql other options down the road).
- it would be easy to have the "in memory job archive" be the default if none is configured by the user
- We'd need some common "job data" format to share between the two, but I guess it could just be json.
- job stats would be hard to do with this approach, not sure how to do it at the moment
- This could be a reasonable "middle point", allowing us to keep
-
The work is different enough and the changes significant enough that I'm borderline thinking a re-write would be better / easier (borderline, but I don't think that threshold has been crossed yet).
job-list
did grow over time from its humble beginnings. It wouldn't be ridiculous to think we should re-start simple and grow from there. -
debating if we should revisit https://github.com/flux-framework/flux-core/pull/3611. We're going to want to have a "recently inactive" cache list for the common job listings (should coincide with a different default for max_entries to return). Pondering if it'd be easier / better to this before / after
job-db
's first implementation.
still need to scan the KVS on restart b/c job-list builds up its internal list of pending / running jobs from this vs. the journal.
If we load job-list in the rc1 script right after job-manager, then aren't we assured that it can capture all pending/running jobs from the journal? If not maybe we need to rethink how the journal works, since the job-manager would have just finished doing the exact same thing and has it all in memory.
I like @grondo's suggestion that we do a "in memory" database when a statedir is not configured (or perhaps a flag indicating "archive to sqlite"). In this hacky prototype, I had to change a ton of tests to specify a statedir, it's annoying. In addition, I wasn't even sure how to set statedir in some cases. e.g. how to specify unique statedir to each sub flux instance when using --cc.
If it's easier, you could also do what content-sqlite
does and fall back to rundir
if statedir
is undefined. This doesn't persist after the instance is done but it would use the same code paths as when statedir
is defined.
No further thoughts at the moment except a nagging unease that we need to be thinking more broadly about how the job modules interact and try to adjust the design to make everything work better together if the current design is failing us.
If we load job-list in the rc1 script right after job-manager, then aren't we assured that it can capture all pending/running jobs from the journal? If not maybe we need to rethink how the journal works, since the job-manager would have just finished doing the exact same thing and has it all in memory.
My skimming of the code is that we only send new events via the journal, not old ones. We could send all old events, which might make sense once we've started removing inactive jobs.
Ah gotcha. Right now all jobs are killed when an instance is restarted so we don't have that problem but we will. Good point. Let's see if we can work out how to let the job manager send everything over because then we can get rid of the sequence numbers and avoid duplicate effort.
Edit: I do have that on a branch somewhere, at least a prototype.
Ah gotcha. Right now all jobs are killed when an instance is restarted so we don't have that problem but we will. Good point. Let's see if we can work out how to let the job manager send everything over because then we can get rid of the sequence numbers and avoid duplicate effort.
Was thinking about this problem today, one idea that came to me. When job-list
requests a stream of journal events from the job-manager
, could the job-manager
send all the events for all currently pending & running jobs?
There is a small race where a job could finish and go inactive between the time job-manager starts and job-list requests the journal. Perhaps we could just cache these until job-list
connects up. All other inactive jobs could be assumed to be in the sqlite archive.
began experimenting with the sqlite JSON type and with the the new job-list
"all" attribute, second attempt at a prototype job-db
went a lot better. Now all job data is stored in a single JSON blob and put into the archive. So when re-loading data from the archive, its easy, just reading one json blob back.
All tests pass except for a few related to the new job-manager
purge inactive jobs functionality (purging in job-list
doesn't make sense when all inactive jobs are in sqlite), just didn't deal with it at the moment.
https://github.com/chu11/flux-core/tree/issue4273_job_db_prototype_archive_oneblob
interesting notes
- as I commented in #4234, sqlite's json support is optionally built in, so it's not clear what versions support it and what versions might not. So I think it may be best to not assume json support. So this required the following table to be in my prototype:
const char *sql_create_table = "CREATE TABLE if not exists jobs("
" id CHAR(16) PRIMARY KEY,"
" t_inactive REAL,"
" jobdata JSON,"
" eventlog TEXT,"
" jobspec JSON,"
" R JSON"
basically, we still need the t_inactive
column so we can sort jobs (via the sql query) that we return to the user.
- when
statedir
isn't available, i just store all archive data to memory but discussion above. not sure if this is a wise plan long term or not.
Got some far more clear refactorings going forward that will hopefully get us closer to where we want to be:
-
support
states_mask
injob-list
, the bitmask that holds all job states that have been reached -
there is inconsistency between some
job-list
values being returned with a "default" and some that aren't returned at all. It would make sense to make those consistent. -
there is inconsistency between how
job-list
internally stores data, e.g.const char *name
might point to a value inside the jobspec, where aschar *ranks
isstrdup
-ed. This became bothersome upon sqlite re-load of data and took me awhile to hunt down mem-corruption. Not sure how to refactor at the moment. -
also would be wise to refactor the
struct job
into its own file or something -
with the
job-list
all
attribute, it might be worthwhile to add aversion
to the json object, in case of changes in the future (related to #2635)
Update:
-
and we'll also want to have a
inactive
cache. I'm thinking that keeping the list of inactive jobs, but simply kicking off entries when the list gets too long would be reasonable to do. It might be worthwhile to loosely reconsider what I did in #3611 for this. As an initial implementation, we can default the list length to "infinite", as to maintain current behavior. Then once job-archive is integrated in, we can change the default to something else. -
note that this would be in contrast to a LRU cache,which I'm thinking will be the not as common case. i.e. some users will want to look up old jobs once in awhile, but there won't be a need to cache those recent lookups b/c they won't happen again (that often). The most recently run inactive jobs will be the common case b/c of use with
flux jobs
. -
Edit: Hmmm, actually. I think we could keep the current
purge
support. We simply store to the job-db/archive whenever an item is added to the inactive list. Would just have to add some smarts to look in both places (possibly).
Update2:
- oh yeah, i have increasing belief we don't need to create a new module. We can add
job-db
tojob-list
.
i think i have a prototype that is approaching something "good" that we'd like to consider for actual use.
- add database to the
job-list
module - we should only support database archiving if a
dbpath
orstatedir
is configured, if not,job-list
behaves just like it does now (including behavior with job purging). In other words,job-list
can lose job history in a non-statedir / non-dbpath configured instance.- this is also convenient, as an "inactive job cache" / inactive job list just sort of is left in place
- I'm going with a db table of
const char *sql_create_table = "CREATE TABLE if not exists jobs("
" id CHAR(16) PRIMARY KEY,"
" t_inactive REAL,"
" jobdata JSON,"
" eventlog TEXT,"
" jobspec JSON,"
" R JSON"
The main reason we have the t_inactive
is b/c this is backwards compatible to sqlite versions that don't have json support.
did we decide that job-archive could stick around as long as flux-accounting needs it (but ultimately flux-accounting would transition to this?)
@garlick yeah, job-archive
will stick around until it is no longer needed.