trackme icon indicating copy to clipboard operation
trackme copied to clipboard

Host Monitoring Not Enabling

Open cidermark opened this issue 3 years ago • 18 comments

Hi Guilhem,

I'm not sure if this a bug or a misconfiguration but I'm trying to enable monitoring on a load of hosts. Of the 112 hosts, 67 remain enabled but the other 45 revert to 'disabled' after the 5 minute refresh.

Is there some kind of log that I can look at to help diagnose the issue?

Cheers, Mark.

cidermark avatar Oct 06 '21 07:10 cidermark

Hi Mark,

Hum right, there are a few conditions where host can be put on disabled automatically:

  1. trackme_auto_disablement_period

There is a macro, which you can check out in the UI "TrackMe Manage and configure", the default definition says the following:

relative_time(now(), "-45d")

This basically indicates by default that is a given data source did not receive any data for more than this period, the entity will get automatically disabled.

  1. custom rest call action

One could setup a custom alert action using the trackMe rest API:

https://trackme.readthedocs.io/en/latest/userguide.html#alerts-tracking-trackme-alert-actions

Basically one could have setup an action to disable the host automatically.

Note that the same thing could be achieved from the outside using a REST call.

In both cases, this would get tagged on the audit collection and the audit changes.

  1. custom report of yours

One could well have a custom logic to update the collection records based on a custom logic, basically updating the KVstore collection records.

In any case:

  • If TrackMe did it, there should be traces on the flipping status for that host, how this looks like?

  • Same, when the entity gets disabled, if the action comes from TrackMe this will be logged in the audit changes UI

Let me know if that makes sense

guilhemmarchand avatar Oct 06 '21 08:10 guilhemmarchand

Hi Guilhem, Thanks for the rapid response.

  1. I haven't changed this value from the installation default.
  2. I've not really done anything with the REST API
  3. I'm the only one looking/using/configuring TrackMe and I haven't created any custom reports - only enabled one of the default Alerts.

I had a look at the 'Flipping' status and the ones that keep getting reset look very different from the others: image

(not sure if that image is viewable or not)

Cheers, Mark

cidermark avatar Oct 06 '21 09:10 cidermark

Yes the screenshot is visible.

Hum this looks weird, seems to indicate that this host is continously being discovered over and over again.

Some question then:

I would recommend to be restrictive enough on the data hosts to start in good conditions, it tends to contain too much crap data and it's hard to a have a good vision.

So, I recommend generally to:

  • Add a few indexes to start in allow list in data host monitoring, which indexes you have qualified to be real indexes containing endpoint related data (avoid for example things like proxy data where the host value is in fact not an endpoint of yours these kind of things)

  • After you added a first index in allow list, reset the data host collection

  • Starting from there you will not need anymore to reset the collection

What does the record looks like?

| inputlookup trackme_host_monitoring | eval keyid=_key
| search data_host="xxx"

Can you try to delete the host from the UI, then run the tracker a few times to see how it is behaving

One option would be that you lots of crap in there, a very large number of host containing a very large number of sourcetypes etc. For Data host you need to be restrictive and qualify properly what to include.

guilhemmarchand avatar Oct 06 '21 09:10 guilhemmarchand

For docs references:

  • https://trackme.readthedocs.io/en/latest/configuration.html#step-2-configure-trackme-to-match-your-needs

guilhemmarchand avatar Oct 06 '21 09:10 guilhemmarchand

HI there, The hosts that I'm looking at here are all Splunk infrastructure (in this case, they are HF's) so, I guess, it's the _internal index that is being monitored. There doesn't appear to be any issues with the _internal logs.

  1. I've attached the output from the search. I also included 3 other hosts that are correctly being monitored. It's the last one in the list that has the problem.

  2. As suggested, I deleted the host and have run the short term and long term trackers several times. Unfortunately, the host hasn't been rediscovered :(

host.monitoring.results.csv

Mark

cidermark avatar Oct 06 '21 10:10 cidermark

@cidermark

When you deleted the host, did you use permanent deletion or temporary deletion? If permanent it won't come up on its own

You can check your action in the audit change tab

As well:

  • Do you have anything in allow list for data hosts?
  • If you do, do you have the _internal?

Guilhem

guilhemmarchand avatar Oct 06 '21 12:10 guilhemmarchand

@guilhemmarchand

I did a temporary deletion and ran both short and longterm trackers several times with the same results. I did then try a permanent delete :(

With regards to the allow/block - all lists are at the defaults installation settings. I haven't added or removed anything from those.

Mark.

cidermark avatar Oct 07 '21 09:10 cidermark

Hi @cidermark

When you delete an host through the UI, this creates a deletion record in the audit change, example:

| inputlookup trackme_audit_changes | eval key=_key | search object="EVENTGEN.RETAIL"

image

To allow the host to be re-created, you can update this record, for example:

| inputlookup trackme_audit_changes | eval key=_key | search object="EVENTGEN.RETAIL" AND key="615ecfddeb20813a9e41894f"
| eval change_type="delete temporary"
| outputlookup append=t key_field=key trackme_audit_changes

Then, when running the tracker the host can be re-created if the data allow it.

Now if you host still is not created, you can start from this search:

| savedsearch "TrackMe - Data hosts abstract root tracker"

| search data_host="EVENTGEN.RETAIL"

And check what is going on, you can expand the search and go step by step to understand why it wouldn't be created. This savedsearch is called by both trackers

guilhemmarchand avatar Oct 07 '21 10:10 guilhemmarchand

Hi @guilhemmarchand

I followed the advice to re-add the server and it's back in the list.

I re-enabled monitoring but, sadly, it still reverts back to not monitored after 5 minutes :(

Mark.

cidermark avatar Oct 07 '21 15:10 cidermark

@cidermark

Right, ok so now that it's back in the collection let's continue. You basically say that the field data_monitored_state gets back disabled on its own, hum, there must be a reason. Let me think about it and share a few searches to troubleshoot

guilhemmarchand avatar Oct 07 '21 15:10 guilhemmarchand

@cidermark

In my previous message I was showing this:

| savedsearch "TrackMe - Data hosts abstract root tracker"

| search data_host="EVENTGEN.RETAIL"

Adapt this to your own case, then run this command over the last 4 hours for instance, and expand the search.

You will get a quite large search, there are parts of the code which are dealing with the data_monitored_state:

image

image

While comparing these with yours, do you see anything special?

  1. local config

Can you please checkout in:

/opt/splunk/etc/apps/trackme/local/

And checkout any local config file you have, especially savedsearches.conf and macros.conf, anything in there?

guilhemmarchand avatar Oct 08 '21 15:10 guilhemmarchand

Hi @cidermark

Let me know if you have any update ;-)

guilhemmarchand avatar Oct 13 '21 06:10 guilhemmarchand

Hi @guilhemmarchand ,

I'll get on to this as soon as I can but I'm away from my computer this week. Hopefully I'll be able to take a look tomorrow.

Mark

cidermark avatar Oct 13 '21 10:10 cidermark

No problem @cidermark just wanna make sure we don't leave that out. If you keep struggling on this one then we could have some live chat and check this together.

Guilhem

guilhemmarchand avatar Oct 15 '21 08:10 guilhemmarchand

Hi @guilhemmarchand Sorry it took me a while to get back to you - this is my 1st day back!!! I ran the ( slightly modified) search as requested and noticed that the servers that are not staying enabled seem to have quite a number of missing fields compared to the ones that are enabled. e.g. data_host_alerting_policy, data_previous_host_state, enable_behaviour_analytic, priority - 18 fields in all.

image

I couldn't find anything especially notable in the local directory - just a modified macros.conf and savedsearches.conf

Does this give any better insight as to what the problem may be?

Again, thanks for your help, Mark

cidermark avatar Oct 25 '21 10:10 cidermark

Hi @guilhemmarchand - any thoughts on my response?

Cheers, Mark.

cidermark avatar Nov 05 '21 13:11 cidermark

Hi @cidermark

Thanks for the remind ;-) Yes, as such it doesn't really allow me to understand the issue.

One potential root cause I think might be due to the search breaking due to a way too large number of sourcetypes for a x number of hosts.

This can happen with some bad practices such as dynamic sourcetyping, can you run:

So the following would show up with the biggest from the collection:

| inputlookup trackme_host_monitoring | eval keyid=_key
| eval len=len(data_host_st_summary)
| sort limit=0 - len
| table data_host, data_host_st_summary, *

Which could be reflected from the data:

| tstats count as data_eventcount where sourcetype=* host=* host!="" `trackme_tstats_main_filter` ( ( `trackme_get_idx_whitelist(trackme_data_host_monitoring_whitelist_index, data_index)` `apply_data_host_blacklists_data_retrieve` ) OR `trackme_tstats_main_filter_for_host` ) by index, sourcetype, host 
| stats dc(sourcetype) as dcount, values(sourcetype) by host
| sort 0 - dcount

What we want to find out is host have a seriously large number of sourcetypes, which should be excluded from the host tracking.

Let me know

guilhemmarchand avatar Nov 05 '21 14:11 guilhemmarchand

@cidermark

Thinking about it, the esiest might be that we have a look together, I believe you have some form of exceptions here and I am sure there's a reason.

My email is: [email protected] You can ping me on Splunk community Slack too then we can meet when convenient.

Guilhem

guilhemmarchand avatar Nov 11 '21 21:11 guilhemmarchand