watchdogd.conf
—
watchdogd configuration file
The default
watchdogd(8)
use-case does not require a configuration file. However, enabling a health
monitor plugin, the process supervisor, or multiple watchdog device nodes, is
done using
/etc/watchdogd.conf.
Available health monitor plugins:
supervisor
- Process supervisor, monitor the heartbeat of processes
filenr
- File descriptor monitor, also covers sockets, and other descriptor based
resources
fsmon
- File system monitor, checks both available blocks and inodes.
generic
- Generic script monitor
loadavg
- CPU load average monitor
meminfo
- Memory usage monitor
tempmon
- Temperature monitor
This file is a standard UNIX .conf file with sub-sections and '=' for
assignment. The '#' character marks start of a comment to end of line, and the
'\' character can be used as an escape character. Whitespace is ignored,
unless inside a string.
Warning: do not set the below WDT timeout and kick
interval too low. The daemon (usually) runs as a regular
‘
SCHED_OTHER
’ background task and the
monitor plugins (as well as your other services) need CPU time as well.
timeout
=
SEC
- The WDT timeout before reset. Default: 20 sec.
interval
=
SEC
- The kick interval, i.e. how often
watchdogd(8)
should reset the WDT timer. Default: 10 sec
safe-exit
=
true |
false
- With safe-exit enabled (true) the daemon will ask the driver disable the
WDT before exiting (SIGINT). However, some WDT drivers (or HW) may not
support this. Default: true
script
=
/path/to/reboot-action.sh
- Script or command to run instead of reboot when a monitor plugin reaches
any of its critical or warning level. Setting this will disable the
default reboot action on critical, it is therefore up to the script to
perform reboot, if needed. The script is called as:
script.sh {filenr, loadavg, meminfo} {crit, warn} VALUE
Health monitor plugins also have their own local script setting.
device
/path/to/device
{}
watchdogd
supports kicking multiple
watchdog devices. By default, and with no command line arguemts,
/dev/watchdog is used. If that is your
system, this section is not necessary. This section is only useful if you
want everything in the configuration file or have multiple watchdog
devices. See the EXAMPLE
section below.
timeout
=
SEC
- Same as global option.
interval
=
SEC
- Same as global option.
safe-exit
=
true |
false
- Same as global option.
reset-reason
{}
- This section controls the reset reason, including the reset counter. By
default this is disabled, since not all systems allow writing to disk,
e.g. embedded systems using MTD devices with limited number of write
cycles.
enabled
=
true |
false
- Enable or disable storing reset cause, default: disabled
file
=
/var/lib/misc/watchdogd.state
- The default file setting is a non-volatile path, according to the FHS.
It can be changed to another location, but make sure that location is
writable first.
Note: This section was previously called
reset-cause
, which is deprecated and
may be removed in a future release.
supervisor
{}
- Instrumented processes can have their main loop supervised. Processes
subscribe to this service using the libwdog API, see the docs for more on
this. When enabled
watchdogd
switches
to ‘SCHED_RR
’ with elevated realtime
priority. When disabled it runs as a regular
‘SCHED_OTHER
’ process.
enabled
=
true |
false
- Enable or disable supervisor, default: disabled
priority
=
NUM
- The realtime priority. Default: 98
script
=
/path/to/script.sh
- When a supervised process fails to meet its deadline the supervisor by
default performs an unconditional reset, saving the reset cause first.
However, if a script is provided in this section it will be called
instead:
script.sh supervisor CAUSE PID LABEL
The CAUSE value is documented in
watchdogctl(1).
The LABEL can be any free form string
the supervised process used when registering with the supervisor,
hence it is given as the last argument to the script.
The return value of the script determines how the system continues to
operate: POSIX OK (0) means the script has handled the situation in
some manner and watchdogd
stops
supervising the offending process, a non-zero return value from script
means the script has either failed to handle the situation or prefers
to delegate to watchdogd
to save
the reset cause and perform the actual system reset.
The global script setting does not apply to this section. However, the
same script can be used, due to the unique first argument.
IMPORTANT:
Calling
watchdogctl(1)
from the script with the fail command
will cause an infinite loop. It is strongly advised to return non-zero
from the script instead.
filenr
{}
- Monitors file descriptor leaks based on
‘
/proc/sys/fs/file-nr
’.
enabled
=
true |
false
- Enable or disable plugin, default: disabled
interval
=
SEC
- Poll interval, default: 300 sec
logmark
=
true |
false
- Log current stats every poll interval. Default: disabled
warning
=
LEVEL
- High watermark level, alert sent to log.
critical
=
LEVEL
- Critical watermark level, alert sent to log, followed by reboot or
script action.
script
=
/path/to/reboot-action.sh
- Optional script to run instead of reboot if critical watermark level
is reached. If omitted the global
‘
script
’ action is used. The
script is called the same way as the global script, same
arguments.
fsmon
/mounpoint {}
- Monitors a file system using the given path
/mountpoint for block and inode usage. If
either exceeds the configured watermarks action is taken. Multple file
systems can be monitored using, see the
EXAMPLE section below.
The script is called with the
fsmon
label
as the first argument, and the monitored path and exceeded resource are
available as environment variables:
FSMON_TYPE
- One of 'blocks' or 'inodes' that exceeded the watermark.
FSMON_NAME
- Name of monitored path.
The settings are the same as the other monitor plugins:
enabled
=
true |
false
- Enable or disable plugin, default: disabled
interval
=
SEC
- Poll interval, default: 300 sec
logmark
=
true |
false
- Log current stats every poll interval. Default: disabled
warning
=
LEVEL
- High watermark level, alert sent to log.
critical
=
LEVEL
- Critical watermark level, alert sent to log, followed by reboot or
script action.
script
=
/path/to/reboot-action.sh
- Optional script to run instead of reboot if critical watermark level
is reached. If omitted the global
‘
script
’ action is used. The
script is called the same way as the global script, same
arguments.
generic
/path/to/monitor-script.sh {}
- Monitor status of a generic script. Called every
interval
seconds, with a deadline of
timeout
seconds. Trigger warning and
critical actions are based on the exit code of the script.
enabled
=
true |
false
- Enable or disable plugin, default: disabled
interval
=
SEC
- How often to run the
monitor-script
, default: 300
sec
timeout
=
SEC
- Maximum runtime of script, in seconds, default: 300 sec
warning
=
VAL
- High watermark level, alert sent to log if exit status from
monitor-script
is greater or equal
to this value.
critical
=
VAL
- Critical watermark level, alert sent to log, followed by reboot or
script
action if
monitor-script
exit status is
greater or equal to this value.
monitor-script
=
/path/to/generic-script.sh
(DEPRECATED)
- Monitor script to run every
interval
seconds. This setting is
deprecated in favor of the new syntax:
generic /path/to/monitor-script.sh { ... }
If the new syntax is not used,
watchdogd.conf
falls back to look
for this setting.
script
=
/path/to/reboot-action.sh
- Optional script to run instead of reboot if critical watermark level
is reached. If omitted the global
‘
script
’ action is used. The
script is called the same way as the global script, same
arguments.
loadavg
{}
- Monitors load average based on
sysinfo(2) from
‘
/proc/loadavg
’. The trigger level
for warning and critical watermarks is composed from the average of the 1
and 5 min marks.
Note: load average is a blunt instrument and
highly use-case dependent. Peak loads of 16.00 on an 8 core system may be
responsive and still useful but 2.00 on a 2 core system may be completely
bogged down. Read up on the subject and test your system before enabling
the critical level.
enabled
=
true |
false
- Enable or disable plugin, default: disabled
interval
=
SEC
- Poll interval, default: 300 sec
logmark
=
true |
false
- Log current stats every poll interval. Default: disabled
warning
=
LEVEL
- High watermark level, alert sent to log.
critical
=
LEVEL
- Critical watermark level, alert sent to log, followed by reboot or
script action.
script
=
/path/to/reboot-action.sh
- Optional script to run instead of reboot if critical watermark level
is reached. If omitted the global
‘
script
’ action is used. The
script is called the same way as the global script, same
arguments.
meminfo
{}
- Monitors free RAM based on data from
‘
/proc/meminfo
’.
enabled
=
true |
false
- Enable or disable plugin, default: disabled
interval
=
SEC
- Poll interval, default: 300 sec
logmark
=
true |
false
- Log current stats every poll interval. Default: disabled
warning
=
LEVEL
- High watermark level, alert sent to log.
critical
=
LEVEL
- Critical watermark level, alert sent to log, followed by reboot or
script action.
script
=
/path/to/reboot-action.sh
- Optional script to run instead of reboot if critical watermark level
is reached. If omitted the global
‘
script
’ action is used. The
script is called the same way as the global script, same
arguments.
Monitor one or more temperature sensors, both hwmon and thermal supported. The
default warning level is 90% of the declared critical temperature, if a sensor
does not have a declared critical temperature, 100°C is used.
The monitor tracks the last 10 readings and uses the mean temperature in
comparisons with the warning and critical watermarks. The
logmark
setting control if this is logged
or not, when enabled, logs are emitted every 10th interval (T x 10).
¡¡Note: the
critical
watermark is
disabled by default, i.e., no action!!
tempmon
/path/to/sys/class/sensor {}
- Monitors a given temperature sensor, either a
- ‘
hwmon
’,
- e.g.,
/sys/class/hwmon/hwmon1/temp1_input
or
- ‘
thermal
’,
- e.g.,
/sys/class/thermal/thermal_zone1/temp
If the mean temperature over 10 x interval readings exceed any of the
configured watermarks, action is taken. You likely want to use the custom
script
to, e.g., check a fan
controller, or even
poweroff(8)
the system, unless of course you have firmware that handles this.
enabled
=
true |
false
- Enable or disable plugin, default: disabled
interval
=
SEC
- Sensor poll interval. The monitor uses the mean value over the latest
10 readings, so a lower poll interval is better (and a cheap
operation). E.g., poll every 30 sec, log every 300 seconds,
continuously evaluate against watermarks.
Default:
300
sec. Strongly recommended to
change this!
logmark
=
true |
false
- Log measurements every
10 x
interval
seconds. However, if the mean value rises above a
threshold a warning is logged every interval.
Default: disabled.
warning
=
LEVEL
- High watermark level, used as percentage of the declared critical
temperature. E.g., say the sensors critical (or max) value is
128°C and you set
warning
to
0.8 (80%), the trip point is calculated as:
0.8 x 128.0 = 102.4
.
When the watermark is reached and alert is logged and the local, or
global, script is called.
Default:
0.9
, 90% of declared critical
temperature.
critical
=
LEVEL
- Critical watermark level, works like
warning
, except for the action. An
emergency alert is logged, followed by reboot or script action.
Default:
0.0
, meaning no action is taken!
I.e., it is up to the operator to define the level at which to take
action. (Some systems have firmware that automatically power-off to
self-protect.)
script
=
/path/to/script.sh
- Optional script to run instead of reboot if critical watermark level
is reached. If omitted the global
‘
script
’ action is used. The
script is called the same way as the global script, same
arguments.
The tempearture data for all sensors is cached to a JSON file that is updated
atomically every five seconds, when at least one temp monitor is active. The
format is not guaranteed to be stable between releases, but will most likely
be anyway. See
‘
/run/watchdogd/tempmon.json
’.
### /etc/watchdogd.conf
### Watchdogs ##########################################################
# Global settings that can be overridden per watchdog
timeout = 20
interval = 10
safe-exit = true
# Multiple watchdogs can be kicked, the default, even if no .conf file
# is found or device node given on the command line, is /dev/watchdog
device /dev/watchdog {
timeout = 20
interval = 10
safe-exit = true
}
#device /dev/watchdog2 {
# timeout = 20
# interval = 10
# safe-exit = true
#}
### Process Supervisor #################################################
supervisor {
enabled = true
priority = 98
}
### Reset Reason #######################################################
reset-reason {
enabled = true
file = "/var/lib/misc/watchdogd.state"
}
### Checkers/Monitors ##################################################
#
# Script or command to run instead of reboot when a monitor plugin
# reaches any of its critical or warning level. Setting this will
# disable the built-in reboot on critical, it is therefore up to the
# script to perform reboot, if needed. The script is called as:
#
# script.sh {filenr, loadavg, meminfo} {crit, warn} VALUE
#
#script = "/path/to/reboot-action.sh"
# Monitors file descriptor leaks based on /proc/sys/fs/file-nr
filenr {
enabled = true
interval = 300
logmark = false
warning = 0.9
critical = 0.95
# script = "/path/to/alt-reboot-action.sh"
}
# Monitors a file system, blocks and inode usage against watermarks
# The script is called with fsmon as the first argument and there
# are two environment variables FSMON_NAME, for the monitored path,
# and FSMON_TYPE indicating either 'blocks' or 'inodes'.
fsmon /var {
enabled = true
interval = 60
logmark = true
warning = 0.95
critical = 1.0
# script = "/path/to/alt-reboot-action.sh"
}
fsmon /tmp {
enabled = true
interval = 300
logmark = false
warning = 0.95
critical = 1.0
# script = "/path/to/alt-reboot-action.sh"
}
# Generic site-specific script
generic /path/to/monitor-script.sh {
enabled = true
interval = 60
timeout = 10
warning = 10
critical = 100
# script = "/path/to/alt-reboot-action.sh"
}
# Monitors load average based on sysinfo() from /proc/loadavg
# The level is composed from the average of the 1 and 5 min marks.
loadavg {
enabled = true
interval = 300
logmark = false
warning = 1.0
critical = 2.0
# script = "/path/to/alt-reboot-action.sh"
}
# Monitors free RAM based on data from /proc/meminfo
meminfo {
enabled = true
interval = 300
logmark = false
warning = 0.9
critical = 0.95
# script = "/path/to/alt-reboot-action.sh"
}
# Monitor temperature. The critical value is unset by default, so no
# action is taken at that watermark (by default). Both the critical and
# warning watermarks are relative to the trip/critical/max value from
# sysfs. The warning is default 0.9, i.e., 90% of critical. Use script
# to to reset the fan controller or poweroff(8) the system.
#
# Each temp monitor caches the last 10 values, calculates the mean, and
# compares that to the warning and critical levels. Logging of stats,
# the logmark setting, is only done every 10 x interval (if enabled),
# while warnings and critical messages are logged every interval.
tempmon /sys/class/hwmon/hwmon1/temp1_input {
enabled = true
interval = 30
# warning = 0.9
logmark = true
# script = "/script/to/log/and/poweroff.sh"
}
watchdogd(8)
watchdoctl(1)
watchdogd.conf
is an improved version of the
original, created by Michele d'Amico and adapted to uClinux-dist by Mike
Frysinger. It is maintained by Joachim Wiberg at
GitHub.