FS
Documentation

Operating System KBs

From Documentation

(Difference between revisions)
Jump to: navigation, search
Revision as of 08:34, 12 July 2006
Moff (Talk | contribs)
(Run Queue Sentry)
← Previous diff
Revision as of 08:36, 12 July 2006
Moff (Talk | contribs)
(Overview)
Next diff →
Line 566: Line 566:
|- |-
|Run_Queue ||CPU ||Performance ||60s ||align="center" | √ ||align="center" | √ |Run_Queue ||CPU ||Performance ||60s ||align="center" | √ ||align="center" | √
-|- 
-|System_Calls ||CPU ||Performance ||60s ||align="center" | √ || 
|- |-
|Processors ||CPU/Processors ||MultiProcessor ||60s || || |Processors ||CPU/Processors ||MultiProcessor ||60s || ||
 +|-
 +|System_Calls ||CPU ||Performance ||60s ||align="center" | √ ||
|- |-
|Disk ||Disk ||Disk ||120s ||align="center" | √ ||align="center" | √ |Disk ||Disk ||Disk ||120s ||align="center" | √ ||align="center" | √

Revision as of 08:36, 12 July 2006

Contents

Overview

The primary aim of the operating system knowledge bases in Sentinel3G is to provide a base level of operations monitoring that is consistent across various UNIX/Linux platforms. Due to differences between the various operating systems we monitor, complete consistency is not always achievable. This document describes the general content of the OS knowledge bases, and the discrepancies between them on different platforms.


Standard Knowledge Base

The standard knowledge base is OS independent, and so is packaged with Sentinel3G on all Operating Systems. It can be upgraded, but not uninstalled.

Sentry AIX HPUX Linux Solaris Tru64 Unixware Windows
Connectivity¹
Event_Manager¹
Host_Monitor
Scheduler²

¹ Connectivity and Event_Manager sentries are only started on the Event Host.
² Scheduler sentry is not started by default. Please read the online documentation for details on how to use the Scheduler sentry.


OS Knowledge Base Versions

OS Version Availability Date Min Sentinel
Version
AIX risc 2.1 17th Mar, 2004 4.4
HPUX parisc 2.1 11th May, 2004 4.4
HPUX intel 2.2 6th Jul, 2006 4.4
Linux intel 2.1 12th Mar, 2004 4.4
Solaris intel 2.1 14th Jan, 2003 4.4
Solaris sparc 2.2 14th Feb, 2006 4.4
Tru64 alpha 2.1 2nd Jun, 2004 4.4
Unixware intel 1.0 10th Mar, 2004 4.2
Windows intel 2.2 14th Feb, 2006 4.4


OS Knowledge Bases

CPU Class

Sentry AIX HPUX Linux SCO Solaris Tru64 Windows
CPU_States¹
Context_Switches
Interrupts
Run_Queue
Processors
System_Calls
NOTE
Certain operating systems do not provide all the CPU statistics by default, and collecting them may require kernel patches or third party collection tools. Solaris requires packages SUNWaccr and SUNWaccu. Tru64 requires …

¹ All operating systems monitor % System, % User and % Idle CPU time, some OSes provide more information:

OS More CPU_States
Information
Description
AIX, Solaris, Tru64 % Wait IO The amount of time spent waiting for blocked I/0 to complete.
Linux % Nice CPU The percentage of time that the system is in the user state running processes at low (nice) scheduling priority.
HPUX


Disk Class

Sentry AIX HPUX Linux SCO Solaris Tru64 Windows
Disk


Error Log Class

Sentry AIX HPUX Linux SCO Solaris Tru64 Windows
Error_Log


Filesystem Class

Sentry AIX HPUX Linux SCO Solaris Tru64 Windows
Filesystem √¹

¹ AIX provides two sentries for free space monitoring, one sentry specifically for /usr (with less sensitive thresholds) and another for the other filesystems.


Memory Class

Sentry AIX HPUX Linux SCO Solaris Tru64 Windows
Paging_Rate
Physical_Memory
Swap_Rate
Swap_Space
NOTE
Certain operating systems do not provide all the memory statistics, as it may not be relevant (eg Swap_Rate on Tru64 and AIX).


Network Class

Sentry AIX HPUX Linux SCO Solaris Tru64 Windows
Collisions
Drops
Errors
Packets_Received √¹ √¹
Packets_Sent √¹ √¹

¹ Packets sent and received are known only as sent and received on Solaris and Linux.


Printers Class

Sentry AIX HPUX Linux SCO Solaris Tru64 Windows
Printers √¹

¹ The Solaris Printer class is "off" by default. This is due to an intermittent issue with the printer agent. The symptoms are excessive cpu usage by the Eventmanager. This only occurs on a very small number of systems.


Processes Class

Sentry AIX HPUX Linux SCO Solaris Tru64 Windows
CPU_Usage
MEM_Usage
Processes¹
NOTE
All OS Knowledge Bases support the Process Management Console, provided as an action against Processes sentry class.

¹ On certain OSes the Processes sentry is turned off by default. Certain instances are provided as examples (nmdb, smdb) only, but should be changed to reflect the system on which the KB is installed. Note also that system services (daemons) are normally monitored via the Services sentry, so check in the Services folder before adding processes to be monitored.


Security Class

Sentry AIX HPUX Linux SCO Solaris Tru64 Windows
Bad_SU √¹

¹ On Linux, the Bad_SU sentry is not started by default, as it needs specific configuration to work correctly. Please read the sentry notes for more information on how to configure this sentry.


Services Class

Sentry AIX HPUX Linux SCO Solaris Tru64 Windows
Services √¹ √²

¹ AIX provides a complete service management interface using lssrc, startsrc and stopsrc. This interface has been implemented via actions on the Services sentry on AIX.

² Linux provides a complete service management interface using chkconfig and the startup/shutdown scripts in /etc/init.d (/etc/rc.d/init.d on older systems). This interface has been implemented via actions on the Services sentry on Linux.


System Class

Sentry AIX HPUX Linux SCO Solaris Tru64 Windows
CPU_Information √¹
Memory_Information
Operating_System
System_Uptime

¹ The Linux OS on the i386 platform provides additional CPU information including the approximate speed and vendor of the processors.


Sentry Details

Overview

Sentry Class Agent Poll Time States Logging
CPU_States CPU Performance 60s AIX only
Context_Switches CPU Performance 60s
Interrupts CPU Performance 60s
Run_Queue CPU Performance 60s
Processors CPU/Processors MultiProcessor 60s
System_Calls CPU Performance 60s
Disk Disk Disk 120s
Error_Log Error_Log ErrorLog 120s
Filesystem Filesystem Filesystem 300s
Paging_Rate Memory Performance 60s AIX only
Physical_Memory Memory Performance 60s Solaris only
Swap_Rate Memory Performance 60s
Swap_Space (Linux)Memory Performance 60s
Swap_Space (non-Linux)Memory Swap 180s
Network Network Network 120s
Printers Printers Printer 180s Solaris only
CPU_Usage Processes ProcessInfo 75s
MEM_Usage Processes ProcessInfo 75s
Processes Processes ProcessInfo 75s
Bad_SU Security BadSU n/a²
Services Services Service 120s
CPU_Information System Information n/a³

¹ Packets sent and received are known only as sent and received on Solaris and Linux.
² The BadSU agent is a LogFile agent, and so does not have a poll time. Any new data is interpreted whenever the logfile being monitored changes.
³ The Information agent is essentially run only once.


Sentry State Details

CPU States Sentry

Availability
AIX¹, HPUX, Linux, SCO, Solaris, Tru64, Windows

Constants (AIX only)

Constant Description Value
CPU_BUSY User + System percentage indicating the CPU is busy 90
CPU_OVERLOADED User + System percentage indicating the CPU is overloaded 95

States (AIX only)

State Severity Condition Escalation
OVERLOAD_CPU warning $cpu_user + $cpu_system > $CPU_OVERLOADED severe after 120s
BUSY_CPU normal $cpu_user + $cpu_system > $CPU_BUSY warning after 120s
NOT_BUSY normal

¹ The CPU States sentry only has constants and states defined for AIX.


Run Queue Sentry

Availability
AIX, HPUX, Linux, SCO, Solaris, Tru64, Windows

Constants

Constant Description Value
RUNQ_WARN Run queue is getting long 3
RUNQ_PROB Run queue is too long 6

States (HPUX, Linux, SCO, Solaris, Tru64, Windows)

State Severity Condition Escalation
Very_Busy alarm $run_queue > $RUNQ_PROB alarm after 210s
Busy normal $run_queue > $RUNQ_WARN warning after 210s
OK normal

States (AIX only)

State Severity Condition Escalation
OVERLOAD warning $run_queue > $RUNQ_PROB severe after 120s
BUSY normal $run_queue > $RUNQ_WARN warning after 120s
NORMAL normal


System Calls Sentry

Availability
AIX, HPUX, SCO, Tru64, Windows

Constants

Constant Description Value
CPU_SYSCALLS Too many system calls per second 10000

States

State Severity Condition Escalation
BUSY normal $sys_per_sec > $CPU_SYSCALLS alarm after 120s
NORMAL normal

Disk Sentry

Availability
AIX¹, HPUX, SCO, Solaris, Tru64, Windows

Constants (Solaris, Tru64)

Constant Description Value
DSK_BUSY_WARN % busy indicating disk is busy 5
DSK_BUSY_PROB % busy indicating disk is very busy 20
DSK_SVCT_WARN Indicates a long service time (ms) 30
DSK_SVCT_PROB Indicates a very long service time (ms) 50

States (Solaris, Tru64)

State Severity Condition Escalation
DSK_VERYBUSY alarm $percent_busy >= $DSK_BUSY_PROB && $service_time >= $DSK_SVCT_PROB
DSK_BUSY warning $percent_busy >= $DSK_BUSY_WARN && $service_time >= $DSK_SVCT_WARN
DSK_NORMAL normal

Constants (AIX, Windows)

Constant Description Value
DSK_BUSY_WARN % busy indicating disk is busy 40
DSK_BUSY_PROB % busy indicating disk is very busy 60

States (AIX only)

State Severity Condition Escalation
DSK_VERYBUSY warning $percent_busy > $DSK_BUSY_PROB severe after 120s
DSK_BUSY normal $percent_busy > $DSK_BUSY_WARN warning after 120s
DSK_NORMAL normal

¹ Unfortunately the service time statistic is not available on AIX. The service time is a better indicator of disk IO performance. Even if a disk is 100% busy, there is no real problem unless the service time for the disk is also getting high.


Error Log Sentry

Availability
AIX only
NOTE
Certain error log entries are ignored by Sentinel 3G. The list of error codes can be found in a file called exclude_errors under the distrib.db folder under the Sentinel installation (/usr/lpp/cosmos/sentinel_4.2/distrib.db by default on AIX)

States (AIX only)

State Severity Condition Escalation
UNKNOWN severe $Type == “unknown” acknowledgement
PERMANENT alarm $Type == “permanent” acknowledgement
TEMPORARY warning $Type == “temporary” acknowledgement
INFORMATION info $Type == “informational” acknowledgement
PENDING info $Type == “pending” acknowledgement
PERFORMANCE info $Type == “performance” acknowledgement


Filesystem Sentry

Availability
AIX, HPUX, Linux, SCO, Solaris, Tru64, Windows

Constants

Constant Description Value
FS_LOW Indicating low free space 10
FS_VERY_LOW Indicating very low free space 5
FS_NEARLY_FULL Indicating the filesystem is nearly full 2
FS_FULL Indicating the filesystem is full 0

States (Linux, Solaris, Tru64)

State Severity Condition Escalation
FULL critical $pct_free == $FS_FULL
NEARLY_FULL severe $pct_free < $FS_NEARLY_FULL
VERY_LOW alarm $pct_free < $FS_VERY_LOW
LOW warning $pct_free < $FS_LOW
SUFFICIENT normal

States (AIX only)

State Severity Condition Escalation
FULL critical $pct_free == $FS_FULL
NO_INODES critical $pct_free_inodes == $FS_FULL
NEARLY_FULL severe $pct_free < $FS_NEARLY_FULL
FEW_INODES severe $pct_free_inodes < $FS_NEARLY_FULL
VERY_LOW alarm $pct_free < $FS_VERY_LOW
VLOW_INODES alarm $pct_free_inodes < $FS_VERY_LOW
LOW warning $pct_free < $FS_LOW
LOW_INODES warning $pct_free_inodes < $FS_LOW
SUFFICIENT normal


Paging Rate Sentry

Availability
AIX, HPUX, Linux, SCO, Solaris, Tru64

Constants

Constant Description Value
OVER_PAGING Too many page ins or outs per second 10

States

State Severity Condition Escalation
BUSY normal $pgins_per_sec >= $OVER_PAGING || $pgouts_per_sec >= $OVER_PAGING alarm after 62s
ACCEPTABLE normal


Physical Memory Sentry

Availability
Linux, Solaris, Windows

Constants

Constant Description Value
RESTIME_LONG Very long residency time 600
RESTIME_OK Acceptable residency time (ms) 40
RESTIME_PROB Indicating residency time is too short 20

States

State Severity Condition Escalation
RAM_IDLE normal $residency_time >= $RESTIME_LONG
RAM_OK normal $residency_time > $RESTIME_OK
RAM_WARN warning $residency_time > $RESTIME_PROB
RAM_PROB alarm


Swap Space Sentry

Availability
AIX, HPUX, Linux, Solaris, Tru64, Windows

Constants

Constant Description Value
SWAP_LOW Low percent free swap space 15
SWAP_VERY_LOW Very low percent free swap space 10

States

State Severity Condition Escalation
VERY_LOW alarm $swap_pct_free <= $SWAP_VERY_LOW
LOW warning $swap_pct_free <= $SWAP_LOW
OK normal


Collisions Sentry

Availability
AIX, Linux, SCO, Solaris, Tru64

Constants

Constant Description Value
NET_COLL_WARN Indicating many collisions 15
NET_COLL_PROB Indicating excessive collisions 30
NET_WORKING Less than this many transfers and the network is under-utilised 50

States

State Severity Condition Escalation
VERY_BUSY alarm $collision_pct >= $NET_COLL_PROB && $packets_out > $NET_WORKING
BUSY warning $collision_pct >= $NET_COLL_WARN && $packets_out > $NET_WORKING
OK normal


Drops Sentry

Availability
AIX, Linux, Windows

Constants

Constant Description Value
DROP_PROBLEM Indicating incoming packet drop rate problem 1

States

State Severity Condition Escalation
NO_DROPS normal $drop_in_rate == 0
PROB_DROPS warning $drop_in_rate < $DROP_PROBLEM
EXCESS_DROPS alarm $drop_in_rate >= $DROP_PROBLEM


Errors Sentry

Availability
AIX, Linux, SCO, Solaris, Tru64, Windows

Constants

Constant Description Value
NET_ERROR_OK Acceptable number of errors 0
NET_ERROR_PROB Unacceptable number of errors 0.05

States

State Severity Condition Escalation
EXCES_ERRORS alarm $error_rate >= $NET_ERROR_PROB
PROB_ERRORS warning $error_rate > $NET_ERROR_OK
EXCESS_DROPS alarm


Printers Sentry

Availability
Linux, Solaris, Windows

States (Solaris only)

State Severity Condition Escalation
NO_PAPER alarm $status == “faulted” && $reason == “paper”
FAULTED alarm $status == “faulted”
UNKNOWN warning $status == “unknown”
IDLE normal $status == “idle”
PRINTING normal $status == “printing”
WAITING normal $status == “waiting”
DISABLED disabled $status == “disabled”


CPU_Usage Sentry

Availability
AIX, HPUX, Linux, SCO, Solaris, Tru64

Constants

Constant Description Value
CPU_HIGH Percentage CPU usage considered high for a process 10
CPU_PROBLEM Unacceptable percentage CPU usage for a process 50

States

State Severity Condition Escalation
VHIGH_CPU info $cpu_percent >= $CPU_PROBLEM warning after 120s, alarm after 180s
HIGH_CPU normal $cpu_percent >= $CPU_HIGH
OK_CPU disabled delete immediately


MEM_Usage Sentry

Availability
AIX, HPUX, Linux, SCO, Solaris, Tru64

Constants

Constant Description Value
MEMORY_HIGH Percentage memory usage considered high for a process 10
MEMORY_PROBLEM Unacceptable percentage memory usage for a process 50

States

State Severity Condition Escalation
VHIGH_MEMORY info $mem_percent >= $MEMORY_PROBLEM warning after 120s, alarm after 180s
HIGH_MEMORY normal $mem_percent >= $MEMORY_HIGH
OK_MEMORY disabled delete immediately


Processes Sentry

Availability
AIX, HPUX, Linux, SCO, Solaris, Tru64

States (AIX only)

State Severity Condition Escalation
DOWN alarm $count == 0
UP normal


BadSU Sentry

Availability
Linux

States

State Severity Condition Escalation
Violation alarm $Count > 1 acknowledgment
Report info $Count == 1 acknowledgment


Services Sentry

Availability
AIX, HPUX, Linux, SCO, Solaris, Tru64, Windows

States (Linux only)

State Severity Condition Escalation
Confused info $Status == “Off” && $PID != “-1”
Unconfigured info $Status == “Unconfigured”
Off normal $Status == “Off”
Stopped warning $PID == -1
Running normal $PID != -1

States (Solaris, Tru64)

State Severity Condition Escalation
DOWN warning $count == 0
UP normal
UNDEFINED¹ normal

¹ This state is not used for monitoring, it is only there as a placeholder for actions.

States (AIX only)

State Severity Condition Escalation
INACTIVE disabled $status == “inoperative”
ACTIVE normal