process-exporter icon indicating copy to clipboard operation
process-exporter copied to clipboard

After rerun the process, the process-exporter not read the information

Open lcgogo opened this issue 6 years ago • 9 comments

After I rerun my process the process-exporter does not read the information. I use ubuntu 16.04 amd64

My config is below - name: "runworker" exe: - /var/www/py36/bin/python cmdline: - manage.py runworker

And my process is below (py36) root@web4:/var/www/cobo# ps -aux | grep runworker root 28982 0.0 0.0 14224 1072 pts/1 S+ 11:32 0:00 grep --color=auto runworker www-data 31014 0.5 3.8 713216 156488 ? Sl Jan25 5:57 /var/www/py36/bin/python manage.py runworker www-data 31015 0.5 3.8 712232 154368 ? Sl Jan25 5:40 /var/www/py36/bin/python manage.py runworker www-data 31016 0.5 3.8 712796 155576 ? Sl Jan25 5:53 /var/www/py36/bin/python manage.py runworker www-data 31017 0.5 3.8 712376 155516 ? Sl Jan25 5:31 /var/www/py36/bin/python manage.py runworker

I have to restart process-exporter to recover it.

The memory is 0B (runworker is the blue line) before my restart. And the CPU is 0% too.

2019-01-26 11 44 26

lcgogo avatar Jan 26 '19 03:01 lcgogo

Hi @lcgogo,

Just to make sure that the problem isn't in grafana or the promql, could you show me the result of querying process-exporter directly?

e.g.

  1. curl myhost:9256/metrics |grep runworker
  2. after your runworker has died and been restarted, again curl myhost:9256/metrics |grep runworker

What version of process-exporter are you using?

ncabatoff avatar Apr 07 '19 16:04 ncabatoff

Hi @ncabatoff, we encountered something similar today.

We have process-exporter (0.6.0) on a host monitoring 4 process groups. All the 4 processes were stopped, say around 17:55 and then restarted 20 minutes later. However, after the restart, process-exporter only showed 3 process groups were up, although a manual check on the machine indicated all 4 were up. Didn't see anything in the logs.

Wondered it could be the pid change that caused the issue, so we tried to reproduce this in another environment by randomly taking processes up and down, but it seemed process-exporter had been able to catch all the changes.

Do you have some ideas on what happened and how to debug this issue? Also, is it correct that process-exporter looks up the pid on each scrape based on the process named defined in the config file? Thanks a lot!

mapshen avatar Mar 24 '20 21:03 mapshen

@lcgogo Did you figure out the issue? We're seeing process-exporter reporting more and more processes incorrectly as more of them have been restarted...

mapshen avatar Mar 31 '20 13:03 mapshen

Can you check with the -recheck option enabled? If the processes exec others and hence change name, etc.. they will not be matched against what you configured again after it was seen for the first time...

flixr avatar Mar 31 '20 16:03 flixr

Thanks so much for chipping in @flixr !

I ruled out that possibility early on for some reason, but now that you brought it up again, it actually could be it on second thought. We do use exec when start up a process in the environment where such an issue is seen, and if a process is captured before exec happens, it will be marked as untracked and hence ignored as it will not be matched. However, since a process is denoted with an ID struct (Pid and StartTimeRel), even when comm, exe or cmdline changes later, it won't be re-checked.

I will test it out and circle back. And if that works, will also make a note of how many more resources will be consumed.

mapshen avatar Apr 01 '20 03:04 mapshen

Mystery resolved.

Was able to confirm it with -recheck, but the cpu time went up about 10x and so did the scrape time, which is not ideal (at all).

Fortunately in our case, the fragment we are interested in exists both before and after exec happens, so that we can reliably identify the process without having to rely on rechecking.

Thanks again for the help @flixr !

mapshen avatar Apr 06 '20 13:04 mapshen

Hi, @ncabatoff , @mapshen .

I just have the same problem: after process restarted, the process-exporter can't find it...

the environment information

  • os: Ubuntu 16.04.6 LTS 4.4.0-117-generic
  • cmdline of the process that i want to monitor: python2.7 /opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/hue/build/env/bin/hue runcherrypyserver
  • process-exporter verison: 0.7.2
  • process-exporter cmdline: process-exporter -config.path process-exporter.yaml -recheck
  • the content of process-exporter's config file:
process_names:
- name: '{{.Matches.Base64_cnVuY2hlcnJ5cHlzZXJ2ZXI}}'
  cmdline:
  - (?P<Base64_cnVuY2hlcnJ5cHlzZXJ2ZXI>runcherrypyserver)
- name: '{{.Matches.Jar}}'
  cmdline:
  - java .*-jar (?:[/A-Za-z0-9_.-]*\/)?(?P<Jar>[A-Za-z0-9_.-]+\.jar)
- name: '{{.Matches.Php}}'
  cmdline:
  - php (?:[/A-Za-z0-9_.-]*\/)?(?P<Php>[A-Za-z0-9_.-]+\.php)
- name: '{{.ExeBase}}'
  cmdline:
  - .+

My operating steps

Step 1: check the num_procs of target process curl localhost:9256/metrics:

namedprocess_namegroup_num_procs{groupname="runcherrypyserver"} 1
root@ecs-hn1b-bd-cdp-edg-2:~# ps -ef | grep runcherrypyserver
root      5952 19101  0 15:37 pts/15   00:00:00 grep --color=auto runcherrypyserver
hue      12626 12553  7 Sep23 ?        02:03:55 python2.7 /opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/hue/build/env/bin/hue runcherrypyserver
root@ecs-hn1b-bd-cdp-edg-2:~#

Step 2: stop process as i expected, the num_procs value becomes 0.

namedprocess_namegroup_num_procs{groupname="runcherrypyserver"} 0

Step 3: start process again After i started process, i think that num_procs value should change back to 1. But the num_procs value is STILL 0.

namedprocess_namegroup_num_procs{groupname="runcherrypyserver"} 0

but target process is running in fact.

grearter avatar Sep 24 '20 07:09 grearter

Hi @grearter , we encountered something similar today. Have you solved the problem now? and how did it work? thanks!

jay-wlj avatar Apr 23 '21 02:04 jay-wlj

Hi @grearter , we encountered something similar today. Have you solved the problem now? and how did it work? thanks!

i hava fiexd the problem by modifying the code

jay-wlj avatar May 25 '21 06:05 jay-wlj