NAME

watchdogd.conf —

watchdogd configuration file

DESCRIPTION

The default watchdogd(8) use-case does not require a configuration file. However, enabling a health monitor plugin, the process supervisor, or multiple watchdog device nodes, is done using /etc/watchdogd.conf.

Available health monitor plugins:

supervisor: Process supervisor, monitor the heartbeat of processes
filenr: File descriptor monitor, also covers sockets, and other descriptor based resources
fsmon: File system monitor, checks both available blocks and inodes.
generic: Generic script monitor
loadavg: CPU load average monitor
meminfo: Memory usage monitor
tempmon: Temperature monitor

SYNTAX

This file is a standard UNIX .conf file with sub-sections and '=' for assignment. The '#' character marks start of a comment to end of line, and the '\' character can be used as an escape character. Whitespace is ignored, unless inside a string.

Warning: do not set the below WDT timeout and kick interval too low. The daemon (usually) runs as a regular ‘SCHED_OTHER’ background task and the monitor plugins (as well as your other services) need CPU time as well.

timeout = SEC

The WDT timeout before reset. Default: 20 sec.

interval = SEC

The kick interval, i.e. how often watchdogd(8) should reset the WDT timer. Default: 10 sec

safe-exit = true | false

With safe-exit enabled (true) the daemon will ask the driver disable the WDT before exiting (SIGINT). However, some WDT drivers (or HW) may not support this. Default: true

script = /path/to/reboot-action.sh

Script or command to run instead of reboot when a monitor plugin reaches any of its critical or warning level. Setting this will disable the default reboot action on critical, it is therefore up to the script to perform reboot, if needed. The script is called as:

script.sh {filenr, loadavg, meminfo} {crit, warn} VALUE

Health monitor plugins also have their own local script setting.

device /path/to/device {}

watchdogd supports kicking multiple watchdog devices. By default, and with no command line arguemts, /dev/watchdog is used. If that is your system, this section is not necessary. This section is only useful if you want everything in the configuration file or have multiple watchdog devices. See the EXAMPLE section below.

timeout = SEC: Same as global option.
interval = SEC: Same as global option.
safe-exit = true | false: Same as global option.

reset-reason {}

This section controls the reset reason, including the reset counter. By default this is disabled, since not all systems allow writing to disk, e.g. embedded systems using MTD devices with limited number of write cycles.

enabled = true | false: Enable or disable storing reset cause, default: disabled
file = /var/lib/misc/watchdogd.state: The default file setting is a non-volatile path, according to the FHS. It can be changed to another location, but make sure that location is writable first.

Note: This section was previously called reset-cause, which is deprecated and may be removed in a future release.

Process Supervisor

supervisor {}

Instrumented processes can have their main loop supervised. Processes subscribe to this service using the libwdog API, see the docs for more on this. When enabled watchdogd switches to ‘SCHED_RR’ with elevated realtime priority. When disabled it runs as a regular ‘SCHED_OTHER’ process.

enabled = true | false

Enable or disable supervisor, default: disabled

priority = NUM

The realtime priority. Default: 98

script = /path/to/script.sh

When a supervised process fails to meet its deadline the supervisor by default performs an unconditional reset, saving the reset cause first. However, if a script is provided in this section it will be called instead:

script.sh supervisor CAUSE PID LABEL

The CAUSE value is documented in watchdogctl(1).

The LABEL can be any free form string the supervised process used when registering with the supervisor, hence it is given as the last argument to the script.

The return value of the script determines how the system continues to operate: POSIX OK (0) means the script has handled the situation in some manner and watchdogd stops supervising the offending process, a non-zero return value from script means the script has either failed to handle the situation or prefers to delegate to watchdogd to save the reset cause and perform the actual system reset.

The global script setting does not apply to this section. However, the same script can be used, due to the unique first argument.

IMPORTANT: Calling watchdogctl(1) from the script with the fail command will cause an infinite loop. It is strongly advised to return non-zero from the script instead.

File Descriptor Monitor

filenr {}

Monitors file descriptor leaks based on ‘/proc/sys/fs/file-nr’.

enabled = true | false: Enable or disable plugin, default: disabled
interval = SEC: Poll interval, default: 300 sec
logmark = true | false: Log current stats every poll interval. Default: disabled
warning = LEVEL: High watermark level, alert sent to log.
critical = LEVEL: Critical watermark level, alert sent to log, followed by reboot or script action.
script = /path/to/reboot-action.sh: Optional script to run instead of reboot if critical watermark level is reached. If omitted the global ‘script’ action is used. The script is called the same way as the global script, same arguments.

File System Monitor

fsmon /mounpoint {}

Monitors a file system using the given path /mountpoint for block and inode usage. If either exceeds the configured watermarks action is taken. Multple file systems can be monitored using, see the EXAMPLE section below.

The script is called with the fsmon label as the first argument, and the monitored path and exceeded resource are available as environment variables:

FSMON_TYPE: One of 'blocks' or 'inodes' that exceeded the watermark.
FSMON_NAME: Name of monitored path.

The settings are the same as the other monitor plugins:

enabled = true | false: Enable or disable plugin, default: disabled
interval = SEC: Poll interval, default: 300 sec
logmark = true | false: Log current stats every poll interval. Default: disabled
warning = LEVEL: High watermark level, alert sent to log.
critical = LEVEL: Critical watermark level, alert sent to log, followed by reboot or script action.
script = /path/to/reboot-action.sh: Optional script to run instead of reboot if critical watermark level is reached. If omitted the global ‘script’ action is used. The script is called the same way as the global script, same arguments.

Generic Script Monitor

generic /path/to/monitor-script.sh {}

Monitor status of a generic script. Called every interval seconds, with a deadline of timeout seconds. Trigger warning and critical actions are based on the exit code of the script.

enabled = true | false

Enable or disable plugin, default: disabled

interval = SEC

How often to run the monitor-script, default: 300 sec

timeout = SEC

Maximum runtime of script, in seconds, default: 300 sec

warning = VAL

High watermark level, alert sent to log if exit status from monitor-script is greater or equal to this value.

critical = VAL

Critical watermark level, alert sent to log, followed by reboot or script action if monitor-script exit status is greater or equal to this value.

monitor-script = /path/to/generic-script.sh (DEPRECATED)

Monitor script to run every interval seconds. This setting is deprecated in favor of the new syntax:

generic /path/to/monitor-script.sh { ... }

If the new syntax is not used, watchdogd.conf falls back to look for this setting.

script = /path/to/reboot-action.sh

Optional script to run instead of reboot if critical watermark level is reached. If omitted the global ‘script’ action is used. The script is called the same way as the global script, same arguments.

CPU Load Average Monitor

loadavg {}

Monitors load average based on sysinfo(2) from ‘/proc/loadavg’. The trigger level for warning and critical watermarks is composed from the average of the 1 and 5 min marks.

Note: load average is a blunt instrument and highly use-case dependent. Peak loads of 16.00 on an 8 core system may be responsive and still useful but 2.00 on a 2 core system may be completely bogged down. Read up on the subject and test your system before enabling the critical level.

enabled = true | false: Enable or disable plugin, default: disabled
interval = SEC: Poll interval, default: 300 sec
logmark = true | false: Log current stats every poll interval. Default: disabled
warning = LEVEL: High watermark level, alert sent to log.
critical = LEVEL: Critical watermark level, alert sent to log, followed by reboot or script action.
script = /path/to/reboot-action.sh: Optional script to run instead of reboot if critical watermark level is reached. If omitted the global ‘script’ action is used. The script is called the same way as the global script, same arguments.

Memory Usage Monitor

meminfo {}

Monitors free RAM based on data from ‘/proc/meminfo’.

enabled = true | false: Enable or disable plugin, default: disabled
interval = SEC: Poll interval, default: 300 sec
logmark = true | false: Log current stats every poll interval. Default: disabled
warning = LEVEL: High watermark level, alert sent to log.
critical = LEVEL: Critical watermark level, alert sent to log, followed by reboot or script action.
script = /path/to/reboot-action.sh: Optional script to run instead of reboot if critical watermark level is reached. If omitted the global ‘script’ action is used. The script is called the same way as the global script, same arguments.

Temperature Monitor

Monitor one or more temperature sensors, both hwmon and thermal supported. The default warning level is 90% of the declared critical temperature, if a sensor does not have a declared critical temperature, 100°C is used.

The monitor tracks the last 10 readings and uses the mean temperature in comparisons with the warning and critical watermarks. The logmark setting control if this is logged or not, when enabled, logs are emitted every 10th interval (T x 10).

¡¡Note: the critical watermark is disabled by default, i.e., no action!!

tempmon /path/to/sys/class/sensor {}

Monitors a given temperature sensor, either a

‘hwmon’,: e.g., /sys/class/hwmon/hwmon1/temp1_input or
‘thermal’,: e.g., /sys/class/thermal/thermal_zone1/temp

If the mean temperature over 10 x interval readings exceed any of the configured watermarks, action is taken. You likely want to use the custom script to, e.g., check a fan controller, or even poweroff(8) the system, unless of course you have firmware that handles this.

enabled = true | false: Enable or disable plugin, default: disabled
interval = SEC: Sensor poll interval. The monitor uses the mean value over the latest 10 readings, so a lower poll interval is better (and a cheap operation). E.g., poll every 30 sec, log every 300 seconds, continuously evaluate against watermarks.
Default: 300 sec. Strongly recommended to change this!
logmark = true | false: Log measurements every 10 x interval seconds. However, if the mean value rises above a threshold a warning is logged every interval.
Default: disabled.
warning = LEVEL: High watermark level, used as percentage of the declared critical temperature. E.g., say the sensors critical (or max) value is 128°C and you set warning to 0.8 (80%), the trip point is calculated as: 0.8 x 128.0 = 102.4.
When the watermark is reached and alert is logged and the local, or global, script is called.
Default: 0.9, 90% of declared critical temperature.
critical = LEVEL: Critical watermark level, works like warning, except for the action. An emergency alert is logged, followed by reboot or script action.
Default: 0.0, meaning no action is taken! I.e., it is up to the operator to define the level at which to take action. (Some systems have firmware that automatically power-off to self-protect.)
script = /path/to/script.sh: Optional script to run instead of reboot if critical watermark level is reached. If omitted the global ‘script’ action is used. The script is called the same way as the global script, same arguments.

The tempearture data for all sensors is cached to a JSON file that is updated atomically every five seconds, when at least one temp monitor is active. The format is not guaranteed to be stable between releases, but will most likely be anyway. See ‘/run/watchdogd/tempmon.json’.

EXAMPLE

### /etc/watchdogd.conf 
 
### Watchdogs ########################################################## 
# Global settings that can be overridden per watchdog 
timeout   = 20 
interval  = 10 
safe-exit = true 
 
# Multiple watchdogs can be kicked, the default, even if no .conf file 
# is found or device node given on the command line, is /dev/watchdog 
device /dev/watchdog { 
    timeout    = 20 
    interval   = 10 
    safe-exit  = true 
} 
 
#device /dev/watchdog2 { 
#    timeout    = 20 
#    interval   = 10 
#    safe-exit  = true 
#} 
 
### Process Supervisor ################################################# 
supervisor { 
    enabled  = true 
    priority = 98 
} 
 
### Reset Reason ####################################################### 
reset-reason { 
    enabled = true 
    file    = "/var/lib/misc/watchdogd.state" 
} 
 
### Checkers/Monitors ################################################## 
# 
# Script or command to run instead of reboot when a monitor plugin 
# reaches any of its critical or warning level.  Setting this will 
# disable the built-in reboot on critical, it is therefore up to the 
# script to perform reboot, if needed.  The script is called as: 
# 
#    script.sh {filenr, loadavg, meminfo} {crit, warn} VALUE 
# 
#script = "/path/to/reboot-action.sh" 
 
# Monitors file descriptor leaks based on /proc/sys/fs/file-nr 
filenr { 
    enabled  = true 
    interval = 300 
    logmark  = false 
    warning  = 0.9 
    critical = 0.95 
#    script = "/path/to/alt-reboot-action.sh" 
} 
 
# Monitors a file system, blocks and inode usage against watermarks 
# The script is called with fsmon as the first argument and there 
# are two environment variables FSMON_NAME, for the monitored path, 
# and FSMON_TYPE indicating either 'blocks' or 'inodes'. 
fsmon /var { 
    enabled = true 
    interval = 60 
    logmark  = true 
    warning  = 0.95 
    critical = 1.0 
#    script = "/path/to/alt-reboot-action.sh" 
} 
 
fsmon /tmp { 
    enabled = true 
    interval = 300 
    logmark  = false 
    warning  = 0.95 
    critical = 1.0 
#    script = "/path/to/alt-reboot-action.sh" 
} 
 
# Generic site-specific script 
generic /path/to/monitor-script.sh { 
    enabled  = true 
    interval = 60 
    timeout = 10 
    warning = 10 
    critical = 100 
#    script = "/path/to/alt-reboot-action.sh" 
} 
 
# Monitors load average based on sysinfo() from /proc/loadavg 
# The level is composed from the average of the 1 and 5 min marks. 
loadavg { 
    enabled  = true 
    interval = 300 
    logmark  = false 
    warning  = 1.0 
    critical = 2.0 
#    script = "/path/to/alt-reboot-action.sh" 
} 
 
# Monitors free RAM based on data from /proc/meminfo 
meminfo { 
    enabled  = true 
    interval = 300 
    logmark  = false 
    warning  = 0.9 
    critical = 0.95 
#    script = "/path/to/alt-reboot-action.sh" 
} 
 
# Monitor temperature.  The critical value is unset by default, so no 
# action is taken at that watermark (by default).  Both the critical and 
# warning watermarks are relative to the trip/critical/max value from 
# sysfs.  The warning is default 0.9, i.e., 90% of critical.  Use script 
# to to reset the fan controller or poweroff(8) the system. 
# 
# Each temp monitor caches the last 10 values, calculates the mean, and 
# compares that to the warning and critical levels.  Logging of stats, 
# the logmark setting, is only done every 10 x interval (if enabled), 
# while warnings and critical messages are logged every interval. 
tempmon /sys/class/hwmon/hwmon1/temp1_input { 
    enabled  = true 
    interval = 30 
#    warning  = 0.9 
    logmark  = true 
#    script   = "/script/to/log/and/poweroff.sh" 
}

AUTHORS

watchdogd.conf is an improved version of the original, created by Michele d'Amico and adapted to uClinux-dist by Mike Frysinger. It is maintained by Joachim Wiberg at GitHub.