qiita
qiita copied to clipboard
Job Submission-Related TODOs
The following is a list of TODOs associated with PR#2767. These are items that would be good to revisit in the near future:
- [x] Create a new unit test for submit(), so that it doesn't have to rely on other tests.
- [ ] Revisit Mapping Torque states to Qiita states w/Jeff and Antonio.
- [ ] Revisit communicating intermediate Torque states to Qiita beyond 'completed'.
- [ ] qdb.complete() call might be better if wrapped in a REST call; avoids potential dependency issues.
- [x] Confirm PBS_JOBID is being set correctly in launch_torque.
- [x] Revisit legacy script qiita-epilogue in launch_torque().
- [ ] As soon as one one validator is shown to be in error, we should kill the rest in processing_job.py.
- [ ] Broker File/Directory Operations so that multiple processes and/or threads don't stomp on each other.
- [ ] Move away from AssertionErrors where appropriate.
- [ ] Add a handler for SIGTERM and possibly SIGQUIT.
- [ ] Remove 'TORQUE_' prefix to config_test.cfg variables. Possibly do the same to 'EBI_' variables.
- [ ] Use 'qstat -u' to parse qstat output for only qiita user.
- [ ] Reconsider using XML parser library, or perhaps encapsulating parsing code.
- [ ] Consider using 'bash -x' in cmd strings.
- [ ] Revisit '.bash_profile' in processing_job.py
- [x] Consider a set of defined resource allocations (Low, Medium, High, etc.) and have plugins define their own needs, while Qiita instance defines how to map those reserved words to actual allocations.
Adding here the list from https://github.com/biocore/qiita/issues/2684:
Decided that I'll put together a list of things we want to improve so we can group discus and address.
- [ ] Currently, when a job completes it will issue a complete_job call, that will issue N validators to validate output; best scenario we issue 2 new jobs. The best will be to be able to manage the local environments and only issue jobs that make sense; AKA issue a new job for N-1 validators, original report.
- [ ] Qiita should be aware of which queues/jobs the completed job was ran on. This information is currently loaded in the QIITA_CONFIG_FILE, which means that if a job is ran in queue A and the servers are running on queue B, the finishing jobs will be issue in A vs B.
- [ ] Allow delete of jobs. I believe that if we simply error the job in Qiita, the next time it tries to update the heartbeat the job in the queue will die cause you can update the heartbeat of an error job but need to test. Original issue.
- [ ] This came up during the deblur reprocessing; the server goes down (which automatically restarts) and the job can't be started so it stays queued but the job is not retried. BTW the error we get when trying is:
qstat: cannot connect to server - [ ] Re-evaluate the need of internal validations ... a good example of why we need is deblur and shotgun, as it can produce an empty BIOM silently and the validation catches it.
- [ ] Use the plugin_launcher and the private_launcher during job submission, currently not used
- [ ] Somehow be able to determine if the job in the queue has died ...
To address revisiting current mapping of states listed above:
A link to current mapping of Torque states to Qiita states for @antgonza and Jeff: https://docs.google.com/spreadsheets/d/1ry_D3xmwQEYbmL5N2gXwYkkupsflmvKU1CoFF0AhkN4/edit?usp=sharing
I think what would be nice is to replace 'running' in Qiita with 'working' and 'validating'. It seems like a common observation is that most of the time a job is simply 'running' and it's not fine-grained enough to be informational. As a Qiita job is really a set of jobs that are running or in the queue to be run, mapping states isn't simply a 1:1 issue. (Not to imply that it's a difficult issue of course.)
It would be nice if plugins could also periodically give updates as to approximately how close to finishing they are, but that's a separate issue.
Just making a note from the Qiita meeting that we want to discuss taking file/directory commands out of job submission, before implementing the change for the next release.
Adding another request here:
- [x] make sure that we can add parameters to the qsub command without having to restart qiita, for example, the location of epilogue, which is different between qiita and qiita-rc
New requests:
- [ ] Add support for certain jobs to be executed locally in the head node, even on systems submitting jobs on Torque. This would involve an environment variable that could be set, and metadata added to a table to direct which jobs should be run where. Original
by mistake originally added to #2554.
I am not clear on this request. Can you clarify if the request is to run jobs on the cluster head node?
From: Antonio Gonzalez [email protected] Reply-To: biocore/qiita [email protected] Date: Wednesday, January 15, 2020 at 7:25 PM To: biocore/qiita [email protected] Cc: Subscribed [email protected] Subject: Re: [biocore/qiita] Job Submission-Related TODOs (#2782)
New requests:
[ ] Add support for certain jobs to be executed locally in the head node, even on systems submitting jobs on Torque. This would involve an environment variable that could be set, and metadata added to a table to direct which jobs should be run where. Original
by mistake originally added to #2554https://github.com/biocore/qiita/issues/2554.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/biocore/qiita/issues/2782?email_source=notifications&email_token=ADNLN3KDH6VMP5Z6UGVC7ITQ57HRFA5CNFSM4GPP4OB2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJCUGPA#issuecomment-574964540, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADNLN3OOIIUX6KVH7PYJMR3Q57HRFANCNFSM4GPP4OBQ.
@jdereus, if I remember correctly this is something you actually requested long time ago but it was classified under the wrong issue so when I was working on the other issue I decided to move here. Basically, the request was that some of the jobs (like updates to the database) should run on the head node vs the queue. The original issue opened by @charles-cowart, can be found here: https://github.com/biocore/qiita/issues/2859. Anyway, the idea is to keep all this kind of suggestions under the same issue so they get considered when implemented.
The node in this case was qiita itself. Not the head node of the cluster.
From: Antonio Gonzalez [email protected] Reply-To: biocore/qiita [email protected] Date: Thursday, January 16, 2020 at 5:03 AM To: biocore/qiita [email protected] Cc: jdereus [email protected], Mention [email protected] Subject: Re: [biocore/qiita] Job Submission-Related TODOs (#2782)
@jdereushttps://github.com/jdereus, if I remember correctly this is something you actually requested long time ago but it was classified under the wrong issue so when I was working on the other issue I decided to move here. Basically, the request was that some of the jobs (like updates to the database) should run on the head node vs the queue. The original issue opened by @charles-cowarthttps://github.com/charles-cowart, can be found here: #2859https://github.com/biocore/qiita/issues/2859. Anyway, the idea is to keep all this kind of suggestions under the same issue so they get considered when implemented.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/biocore/qiita/issues/2782?email_source=notifications&email_token=ADNLN3JQE5MMWX5WELNFCGTQ6BLLRA5CNFSM4GPP4OB2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJD7N2Q#issuecomment-575141610, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADNLN3O2TFCLON4TYCIQQH3Q6BLLRANCNFSM4GPP4OBQ.
Thank you for clarifying! Yes, it's the qiita head node (where the web servers run) ... BTW the qiita system can not submit jobs to the cluster head node.