trackme
trackme copied to clipboard
Host Monitoring Not Enabling
Hi Guilhem,
I'm not sure if this a bug or a misconfiguration but I'm trying to enable monitoring on a load of hosts. Of the 112 hosts, 67 remain enabled but the other 45 revert to 'disabled' after the 5 minute refresh.
Is there some kind of log that I can look at to help diagnose the issue?
Cheers, Mark.
Hi Mark,
Hum right, there are a few conditions where host can be put on disabled automatically:
- trackme_auto_disablement_period
There is a macro, which you can check out in the UI "TrackMe Manage and configure", the default definition says the following:
relative_time(now(), "-45d")
This basically indicates by default that is a given data source did not receive any data for more than this period, the entity will get automatically disabled.
- custom rest call action
One could setup a custom alert action using the trackMe rest API:
https://trackme.readthedocs.io/en/latest/userguide.html#alerts-tracking-trackme-alert-actions
Basically one could have setup an action to disable the host automatically.
Note that the same thing could be achieved from the outside using a REST call.
In both cases, this would get tagged on the audit collection and the audit changes.
- custom report of yours
One could well have a custom logic to update the collection records based on a custom logic, basically updating the KVstore collection records.
In any case:
-
If TrackMe did it, there should be traces on the flipping status for that host, how this looks like?
-
Same, when the entity gets disabled, if the action comes from TrackMe this will be logged in the audit changes UI
Let me know if that makes sense
Hi Guilhem, Thanks for the rapid response.
- I haven't changed this value from the installation default.
- I've not really done anything with the REST API
- I'm the only one looking/using/configuring TrackMe and I haven't created any custom reports - only enabled one of the default Alerts.
I had a look at the 'Flipping' status and the ones that keep getting reset look very different from the others:
(not sure if that image is viewable or not)
Cheers, Mark
Yes the screenshot is visible.
Hum this looks weird, seems to indicate that this host is continously being discovered over and over again.
Some question then:
I would recommend to be restrictive enough on the data hosts to start in good conditions, it tends to contain too much crap data and it's hard to a have a good vision.
So, I recommend generally to:
-
Add a few indexes to start in allow list in data host monitoring, which indexes you have qualified to be real indexes containing endpoint related data (avoid for example things like proxy data where the host value is in fact not an endpoint of yours these kind of things)
-
After you added a first index in allow list, reset the data host collection
-
Starting from there you will not need anymore to reset the collection
What does the record looks like?
| inputlookup trackme_host_monitoring | eval keyid=_key
| search data_host="xxx"
Can you try to delete the host from the UI, then run the tracker a few times to see how it is behaving
One option would be that you lots of crap in there, a very large number of host containing a very large number of sourcetypes etc. For Data host you need to be restrictive and qualify properly what to include.
For docs references:
- https://trackme.readthedocs.io/en/latest/configuration.html#step-2-configure-trackme-to-match-your-needs
HI there, The hosts that I'm looking at here are all Splunk infrastructure (in this case, they are HF's) so, I guess, it's the _internal index that is being monitored. There doesn't appear to be any issues with the _internal logs.
-
I've attached the output from the search. I also included 3 other hosts that are correctly being monitored. It's the last one in the list that has the problem.
-
As suggested, I deleted the host and have run the short term and long term trackers several times. Unfortunately, the host hasn't been rediscovered :(
Mark
@cidermark
When you deleted the host, did you use permanent deletion or temporary deletion? If permanent it won't come up on its own
You can check your action in the audit change tab
As well:
- Do you have anything in allow list for data hosts?
- If you do, do you have the _internal?
Guilhem
@guilhemmarchand
I did a temporary deletion and ran both short and longterm trackers several times with the same results. I did then try a permanent delete :(
With regards to the allow/block - all lists are at the defaults installation settings. I haven't added or removed anything from those.
Mark.
Hi @cidermark
When you delete an host through the UI, this creates a deletion record in the audit change, example:
| inputlookup trackme_audit_changes | eval key=_key | search object="EVENTGEN.RETAIL"
To allow the host to be re-created, you can update this record, for example:
| inputlookup trackme_audit_changes | eval key=_key | search object="EVENTGEN.RETAIL" AND key="615ecfddeb20813a9e41894f"
| eval change_type="delete temporary"
| outputlookup append=t key_field=key trackme_audit_changes
Then, when running the tracker the host can be re-created if the data allow it.
Now if you host still is not created, you can start from this search:
| savedsearch "TrackMe - Data hosts abstract root tracker"
| search data_host="EVENTGEN.RETAIL"
And check what is going on, you can expand the search and go step by step to understand why it wouldn't be created. This savedsearch is called by both trackers
Hi @guilhemmarchand
I followed the advice to re-add the server and it's back in the list.
I re-enabled monitoring but, sadly, it still reverts back to not monitored after 5 minutes :(
Mark.
@cidermark
Right, ok so now that it's back in the collection let's continue. You basically say that the field data_monitored_state gets back disabled on its own, hum, there must be a reason. Let me think about it and share a few searches to troubleshoot
@cidermark
In my previous message I was showing this:
| savedsearch "TrackMe - Data hosts abstract root tracker"
| search data_host="EVENTGEN.RETAIL"
Adapt this to your own case, then run this command over the last 4 hours for instance, and expand the search.
You will get a quite large search, there are parts of the code which are dealing with the data_monitored_state:
While comparing these with yours, do you see anything special?
- local config
Can you please checkout in:
/opt/splunk/etc/apps/trackme/local/
And checkout any local config file you have, especially savedsearches.conf and macros.conf, anything in there?
Hi @cidermark
Let me know if you have any update ;-)
Hi @guilhemmarchand ,
I'll get on to this as soon as I can but I'm away from my computer this week. Hopefully I'll be able to take a look tomorrow.
Mark
No problem @cidermark just wanna make sure we don't leave that out. If you keep struggling on this one then we could have some live chat and check this together.
Guilhem
Hi @guilhemmarchand Sorry it took me a while to get back to you - this is my 1st day back!!! I ran the ( slightly modified) search as requested and noticed that the servers that are not staying enabled seem to have quite a number of missing fields compared to the ones that are enabled. e.g. data_host_alerting_policy, data_previous_host_state, enable_behaviour_analytic, priority - 18 fields in all.
I couldn't find anything especially notable in the local directory - just a modified macros.conf and savedsearches.conf
Does this give any better insight as to what the problem may be?
Again, thanks for your help, Mark
Hi @guilhemmarchand - any thoughts on my response?
Cheers, Mark.
Hi @cidermark
Thanks for the remind ;-) Yes, as such it doesn't really allow me to understand the issue.
One potential root cause I think might be due to the search breaking due to a way too large number of sourcetypes for a x number of hosts.
This can happen with some bad practices such as dynamic sourcetyping, can you run:
So the following would show up with the biggest from the collection:
| inputlookup trackme_host_monitoring | eval keyid=_key
| eval len=len(data_host_st_summary)
| sort limit=0 - len
| table data_host, data_host_st_summary, *
Which could be reflected from the data:
| tstats count as data_eventcount where sourcetype=* host=* host!="" `trackme_tstats_main_filter` ( ( `trackme_get_idx_whitelist(trackme_data_host_monitoring_whitelist_index, data_index)` `apply_data_host_blacklists_data_retrieve` ) OR `trackme_tstats_main_filter_for_host` ) by index, sourcetype, host
| stats dc(sourcetype) as dcount, values(sourcetype) by host
| sort 0 - dcount
What we want to find out is host have a seriously large number of sourcetypes, which should be excluded from the host tracking.
Let me know
@cidermark
Thinking about it, the esiest might be that we have a look together, I believe you have some form of exceptions here and I am sure there's a reason.
My email is: [email protected] You can ping me on Splunk community Slack too then we can meet when convenient.
Guilhem