mdsplus
mdsplus copied to clipboard
Include timestamps in the mdsip "errors" log file
Affiliation MIT PSFC
Description
Mdsip writes to two log files: access
and errors
. (These log files are typically located in /var/log/mdsplus/mdsipd
.) The access
file has timestamps for each log record, however the errors
file does not. Having timestamps on the errors
log records can be useful for troubleshooting some issues (e.g., thousands of mdsip connections in a day resulting in huge log files).
Example
See the timestamps in the access
log file.
Additional context Might be good to enhance both log files so they can easily be parsed by various tools (Python, spreadsheets, log file analysis tools. and so forth).
@joshStillerman also suggested including the session id for the mdsip connection.
Resurfacing this for visibility https://github.com/MDSplus/mdsplus/issues/2619 . I know it's different functionality, but we also would benefit from improved logging and can weigh in on what we encounter (eg. the ability to attribute who did what that broke something via say the PID for an error and the user whose access is associated with that PID).
Can we also add the username that the mdsip session belongs to into the .errors
log?
Trying to associate an error message with the user that produced it is currently quite difficult.
Hi @sflanagan and @ModestMC,
Adding the username
and PID
are also a great suggestions!
I haven't looked at the code yet, so don't know what is possible.
But I do know that analyzing the existing access
and errors
logs don't contain enough information to make it easy to troubleshoot mdsip
issues. (In recent months, I have spent a lot of time digging through the logs to investigate errors; it is presently a difficult task.)
Let me know if you want this upgraded to a US Priority
feature.
Thanks, -MarkW
@mwinkel-dev I can't speak for Sean offhand, but I do think this could potentially be very useful.
My thought (if you're open to it) is that before we try to decide whether this is something that should be US Priority, it'd be good to know how much work it would be to pull off. Would you be willing to spend 15-30 minutes looking into the underlying code that writes these logs and trying to evaluate how much work it'd be to implement? We know you guys have plenty of things to do. If it turns out this gets ugly, we call it a wash or find a compromise, but if you think it's not to bad I think we could sign on to it.
Cheers, Mitchell
Hi @ModestMC,
I glanced at the logging code in November 2024 while researching Issue #2649. At that time, I noticed that the access
log file has timestamps, but that the errors
log file does not.
Thus, my hunch (perhaps wrong) is that upgrading errors
to have timestamps will be straightforward. Just a matter of copying the existing timestamp code from the procedures that write to the access
log.
However, adding the other desired attributes (process ID, user, mdsip
session ID) might be more difficult. Have not investigated that, so don't know which attributes will be easy to include in the error messages.
-MarkW
Hey @mwinkel-dev,
We agree that it will be some work to incorporate the other desired attributes (process ID, user, session ID, etc) in the log file; is there any reason you couldn't include timestamps at least in the interim? Perfect is the enemy of good.
Thank you for putting in the work that you do.
-Mitchell
Hi @ModestMC,
Thanks for the recent post indicating that GA wants a fix for this issue. I have therefore upgraded this to "US Priority". Work on this issue will start in a day or two.
Because this involves the mdsip
service, presumably this fix will also be needed in the ga_atlas
branch.
After a preliminary investigation is done, I will update this GitHub issue with two estimates:
- days needed to add just a timestamp to the
errors
log file, and - days needed to add other information (connection ID, process ID, user) to both the
access
and theerrors
log file.
Note that the complete fix will probably involve two PRs, one for each option listed above. GA can then decide if it makes sense to wait for both PRs so they can be included in a single release of ga_atlas
. Or alternatively, to cherry-pick the option 1) PR immediately to ga_atlas
, and defer the option 2) PR until a later release of ga_atlas
.
Hey @mwinkel-dev,
I wouldn't go so far as to say this fix is 'needed'; the hope is that just adding timestamps is lowhanging fruit that would be easy to incorporate without breaking things.
If it turns out this will be time consuming, it should not delay the ga_atlas fixes for the dispatcher issues. I believe Sean is planning to update Atlas tomorrow, so my guess is that it'll be food for the next update.
It might be good to make a separate issue for major log improvements like connection/process ID and user, as I think they're an extension of this issue's scope (make no mistake, we still think they're very, very good ideas).
Something we wondered today; is it possible to have the logs include which user accessed a particular tree? Would also be nice to toggle such an option on and off
That is an interesting idea. If we go to using the system logs and journalctl they will be organized by connection / service. Regardless maybe a tool could be written to filter the agrigate of them.