Overview
check_leak is a Nagios plugin in Perl, that monitors process memory usage. It emits warnings before the system is overloaded.
Getting the source code
Download the source on the git repository with the command:
git clone http://piggledy.org/projects/check_leak/check_leak.git/
Documentation
Arguments:
Syntax: check_leak -a check_leak -m <mem> -w <warning time> -c <critical time> Options: -a Show all leaking processes -m Memory limit of the system in Mo -w Time to emit a warning notification in hours before the used memory reaches the memory limit -c Time to emit a critical notification in hours before the used memory reaches the memory limit
In order to predict when the system will be totally overloaded, check_leak keeps track of memory consumption information in a file in /tmp/cheak_leak_data.
How does it works
Every time Nagios triggers the script, it records the current memory consumption of all processes and compute the mean memory allocation rate of each process. Those values can be seen using the -a flag of the check_leak command:
# ./check_leak -a Process 6367 is leaking 2 o/s (/usr/sbin/openvpn --config /etc/openvpn/shan.conf --writepid /var/run/openvpn.shan.pid --daemon --setenv SVCNAME openvpn.shan --cd /etc/openvpn --nobind --up-delay --up-restart --script-security 2 --up /etc/openvpn/up.sh --down-pre --down /etc/openvpn/down.sh) Process 11980 is leaking 12 o/s (bash) Process 7239 is leaking 21 o/s (ssh shan) Process 5819 is leaking 22 o/s (/usr/bin/X :0 vt7) Process 9610 is leaking 24 o/s (sshfs shan:/home/lids /home/lids/Distant/shan) Process 7482 is leaking 138 o/s (/opt/firefox/firefox-bin)
Since a process can legitimately allocate memory, only linearly increasing memory consumption is taken into account (to reduce false positive).
For example:
This graph shows the memory consumption of a running Firefox session over 1h30 minutes (1 record every 2 minutes). .. image:: index/images/doc/graph.png * The red curve show memory usage (in byte) * The green shows the memory allocation rate (in byte/s) * The blue curve shows the allocation rate evolution (in byte/s^2)
At about 2500 seconds, the browser was being used, introducing a 15Mo allocation. At this point the allocation rate reaches a threshold level (of 100) meaning the allocation rate was not constant. This value will then be ignored when computing the mean leak rate. This way no false positive alert was emitted for this memory footprint increase.