mdsplus icon indicating copy to clipboard operation
mdsplus copied to clipboard

Include timestamps in the mdsip "errors" log file

Open mwinkel-dev opened this issue 1 year ago • 11 comments

Affiliation MIT PSFC

Description Mdsip writes to two log files: access and errors. (These log files are typically located in /var/log/mdsplus/mdsipd.) The access file has timestamps for each log record, however the errors file does not. Having timestamps on the errors log records can be useful for troubleshooting some issues (e.g., thousands of mdsip connections in a day resulting in huge log files).

Example See the timestamps in the access log file.

Additional context Might be good to enhance both log files so they can easily be parsed by various tools (Python, spreadsheets, log file analysis tools. and so forth).

mwinkel-dev avatar Jan 10 '24 23:01 mwinkel-dev

@joshStillerman also suggested including the session id for the mdsip connection.

mwinkel-dev avatar Feb 12 '24 18:02 mwinkel-dev

Resurfacing this for visibility https://github.com/MDSplus/mdsplus/issues/2619 . I know it's different functionality, but we also would benefit from improved logging and can weigh in on what we encounter (eg. the ability to attribute who did what that broke something via say the PID for an error and the user whose access is associated with that PID).

ModestMC avatar Feb 13 '24 22:02 ModestMC

Can we also add the username that the mdsip session belongs to into the .errors log?

Trying to associate an error message with the user that produced it is currently quite difficult.

sflanagan avatar Feb 13 '24 22:02 sflanagan

Hi @sflanagan and @ModestMC,

Adding the username and PID are also a great suggestions!

I haven't looked at the code yet, so don't know what is possible.

But I do know that analyzing the existing access and errors logs don't contain enough information to make it easy to troubleshoot mdsip issues. (In recent months, I have spent a lot of time digging through the logs to investigate errors; it is presently a difficult task.)

Let me know if you want this upgraded to a US Priority feature.

Thanks, -MarkW

mwinkel-dev avatar Feb 14 '24 19:02 mwinkel-dev

@mwinkel-dev I can't speak for Sean offhand, but I do think this could potentially be very useful.

My thought (if you're open to it) is that before we try to decide whether this is something that should be US Priority, it'd be good to know how much work it would be to pull off. Would you be willing to spend 15-30 minutes looking into the underlying code that writes these logs and trying to evaluate how much work it'd be to implement? We know you guys have plenty of things to do. If it turns out this gets ugly, we call it a wash or find a compromise, but if you think it's not to bad I think we could sign on to it.

Cheers, Mitchell

ModestMC avatar Feb 15 '24 22:02 ModestMC

Hi @ModestMC,

I glanced at the logging code in November 2024 while researching Issue #2649. At that time, I noticed that the access log file has timestamps, but that the errors log file does not.

Thus, my hunch (perhaps wrong) is that upgrading errors to have timestamps will be straightforward. Just a matter of copying the existing timestamp code from the procedures that write to the access log.

However, adding the other desired attributes (process ID, user, mdsip session ID) might be more difficult. Have not investigated that, so don't know which attributes will be easy to include in the error messages.

-MarkW

mwinkel-dev avatar Feb 16 '24 16:02 mwinkel-dev

Hey @mwinkel-dev,

We agree that it will be some work to incorporate the other desired attributes (process ID, user, session ID, etc) in the log file; is there any reason you couldn't include timestamps at least in the interim? Perfect is the enemy of good.

Thank you for putting in the work that you do.

-Mitchell

ModestMC avatar Apr 24 '24 23:04 ModestMC

Hi @ModestMC,

Thanks for the recent post indicating that GA wants a fix for this issue. I have therefore upgraded this to "US Priority". Work on this issue will start in a day or two.

Because this involves the mdsip service, presumably this fix will also be needed in the ga_atlas branch.

After a preliminary investigation is done, I will update this GitHub issue with two estimates:

  1. days needed to add just a timestamp to the errors log file, and
  2. days needed to add other information (connection ID, process ID, user) to both the access and the errors log file.

Note that the complete fix will probably involve two PRs, one for each option listed above. GA can then decide if it makes sense to wait for both PRs so they can be included in a single release of ga_atlas. Or alternatively, to cherry-pick the option 1) PR immediately to ga_atlas, and defer the option 2) PR until a later release of ga_atlas.

mwinkel-dev avatar Apr 25 '24 01:04 mwinkel-dev

Hey @mwinkel-dev,

I wouldn't go so far as to say this fix is 'needed'; the hope is that just adding timestamps is lowhanging fruit that would be easy to incorporate without breaking things.

If it turns out this will be time consuming, it should not delay the ga_atlas fixes for the dispatcher issues. I believe Sean is planning to update Atlas tomorrow, so my guess is that it'll be food for the next update.

It might be good to make a separate issue for major log improvements like connection/process ID and user, as I think they're an extension of this issue's scope (make no mistake, we still think they're very, very good ideas).

ModestMC avatar Apr 25 '24 21:04 ModestMC

Something we wondered today; is it possible to have the logs include which user accessed a particular tree? Would also be nice to toggle such an option on and off

ModestMC avatar Jun 20 '24 23:06 ModestMC

That is an interesting idea. If we go to using the system logs and journalctl they will be organized by connection / service. Regardless maybe a tool could be written to filter the agrigate of them.

joshStillerman avatar Jun 21 '24 01:06 joshStillerman