Sentinel3G Concepts
From Documentation
Revision as of 06:48, 14 June 2013 Mike (Talk | contribs) (→Expressions) ← Previous diff |
Revision as of 06:50, 14 June 2013 Mike (Talk | contribs) (→Actions) Next diff → |
||
Line 329: | Line 329: | ||
Tasks that don’t require any action or judgement by an operator and can safely be run automatically are better implemented as responses. Data is passed to an action from the host monitor either by being written to the action’s STDIN, or, if the flag Uses agent data: is set to yes, through the environment variables $Sentry, $Host, and $Action. For multi-instance sentries you can refer to a specific named instance or use $Instance, which contains the instance name of the primary agent. | Tasks that don’t require any action or judgement by an operator and can safely be run automatically are better implemented as responses. Data is passed to an action from the host monitor either by being written to the action’s STDIN, or, if the flag Uses agent data: is set to yes, through the environment variables $Sentry, $Host, and $Action. For multi-instance sentries you can refer to a specific named instance or use $Instance, which contains the instance name of the primary agent. | ||
- | Note History data and functions and the & <varname> syntax, which are available in state conditions and console text, | + | Note: History data and functions and the & <varname> syntax, which are available in state conditions and console text, |
cannot be used in an action. To pass the value of a history function, use a derived variable. | cannot be used in an action. To pass the value of a history function, use a derived variable. | ||
Revision as of 06:50, 14 June 2013
This section explains the main concepts underlying Sentinel3G.
How Sentinel3G Works
Sentinel3G has four main functions:
- Collect data
- Detect events
- Respond to events
- Present system overview
Sentries monitor resources or processes such as devices, subsystems or applications.
Agents collect data about sentries on each host on behalf of the host monitor. An event occurs when a sentry being monitored changes state. Typically, events are diagnosed by a host monitor from data supplied by agents. The state change could result from a single raw value crossing some predetermined threshold; or it could be a trend derived from the raw data, such as the rate of change of a value. Sentinel3G includes file-watching agents that detect events in system files and log files.
Host monitors report events to the Event Manager on the central host.
The Host Monitor takes some action based on the severity of the event, such as running a predefined command. Persistent problems that can’t be resolved automatically are escalated and passed to operations staff for action. Staff are notified of events by a console, and if necessary by some other means such as e-mail.
A console is a kind of ‘heads-up’ display that gives a concise hierarchical view of the current state of the sentries being monitored. It is both a means to alert operators of an event and a means for them to monitor and respond to events.
Consoles can present information in customized views, such as by region, by host, or by function. Different classes of user see an appropriate level of detail: from a broad enterprise-wide summary for managers to fine detail for operators and enduser administrators. Console users can select a predefined view or sort and filter sentry data to help diagnose problems. Reports provide more details about the current state of the sentry. Graphs chart the changes in the value of agent variables and provide a recent history of the state of the sentry.
At any time, a user can see from the console what events are happening and how serious they are. If reports have been defined, the operator can choose to view them. If there are predefined actions attached to the sentry the user can choose one to run.
Related information about an application or system component (basically its sentries, agents, events and responses) is grouped into a knowledge base. Sentinel3G includes a UNIX/Linux knowledge base. Other knowledge bases (such as Oracle, Windows NT/2000/XP, Network Services) are available as add-ons.
Because events are detected on each local host, only state transitions and not raw data need to be reported to the central host, so data traffic is minimal. This means that Sentinel3G itself adds little extra load to your system, even when monitoring large networks.
The Console
The console is the primary user interface to Sentinel3G, and is the most common way for Sentinel3G to report events and for users to monitor sentries and respond to events. A console displays a series of sentries as icons, which are grouped hierarchically into folders. Different users may have different console views, and different privileges controlling what they are able to see and do.
Day-to-day console operations are described in Monitoring Sentries From the Console.
Overlays and Indicator Icons
Each sentry and folder is represented on the console by an icon. Overlays are small icons that modify the appearance the main icon to show:
- the type of the sentry or folder
- an indicator of the current state or severity of the sentry, or the highest severity of a sentry in a folder
- the value of a sentry’s agent variable, represented by a small graph or chart
This topic lists all of the overlay icons by type and gives a brief description.
Overlays that represent the type of a sentry or folder Indicators that represent a sentry’s state If an indicator is not specified for a state, the default indicator specified in the sentry (thermometer or pie chart) will be used. See Indicators that represent data values from a sentry’s variable.
If no overlay is specified in the sentry or state details, the default overlay icon and color for the current severity is used. See Indicators that represent a sentry’s severity
Icon | Where | Description |
Bottom left | This object is a user-defined folder, containing sentries and possibly other sub-folders. | |
Bottom left | This is a locked or system folder– its contents can be modified but the folder itself can’t be removed as it is required by Sentinel3G. | |
Top right | This is an information-only sentry. It has no states. It gives information through its console text and property sheet. |
Icon | Where | Description |
Bottom right | This sentry is requesting acknowledgement from an operator before changing to another state. | |
Top right | This sentry represents a service that is running. | |
Top right | This sentry represents a service that is not running. Check the console text, notes, or property sheet to find out why. | |
Top left | Notification for this sentry has been disabled. |
Indicators that represent a sentry’s severity
If it is not specified for a given sentry then the default indicator representing the current severity is used. The color of the overlay is determined by the severity of the sentry.
The severity of a folder is the maximum severity of all the sentries and sub-folders it contains.
Icon | Color | Severity | Description |
Wait | Sentry is starting | ||
grey | Disabled | Sentry is in a state where data is not being returned. Examples: the Host Monitor is down, or the resource itself is disabled. | |
grey | Down | Sentry is reporting that a service is not running | |
Normal | Sentry is indicating that there are no problems | ||
blue | Information | Sentry is reporting matters of interest | |
orange | Warning | Sentry is reporting a potential problem | |
red | Alarm | Sentry has detected a serious problem that should be investigated as soon as possible | |
red, flashing | Severe | Sentry has detected a very serious problem that must be investigated now | |
magenta, flashing | Critical | Sentry has detected an extremely serious problem affecting the network or a key application, system, or service. Immediate action is needed. |
Indicators that represent data values from a sentry’s variable
Two types of overlay icon, called indicators, can represent actual data from a variable.
A percentage value is mapped to either small pie chart or thermometer, in increments of at least 10 percent. This gives an immediate indication of what the data value is. The type of overlay (pie chart or thermometer) may be specified for each sentry.
The amount of the ‘pie’ that is filled in or the height of the thermometer’s filled-in area gives a rough indication of the quantity being reported. Here are some examples:
Indicator Type | 0% | 30% | 50% | 80% | 100% |
Pie Chart | |||||
Thermometer |
Sentries and States
A sentry is an individual object or resource that is being monitored though Sentinel3G. Some examples:
- CPU usage on host titanic
- free disk space on filesystem /usr2
- run queue length on host lusitania
- network printer lusitania attached to host endurance
Sentries are grouped into classes and are represented on the console as icons. Each sentry has an agent (or possibly more than one agent) that collects data on its behalf about the resource or object being monitored. The data determines what state the sentry is in at any time.
An information-only sentry has no states attached to it, but simply provides useful status information to operators in the form of console text or via its property sheet.
You can maintain most things about a sentry from the console, including its constants, actions, agent, and variables. For example, to configure a sentry’s states, just select the sentry and then select Configure > States.
States
A sentry’s state represents its current operating status or condition. The entry condition for each state is evaluated in turn until one evaluates to true. Most sentries have a normal state, indicating that it is operating satisfactorily and requires no action, and a number of other abnormal states of increasing severity. For example, a simple sentry that monitors a service may have only a couple of states showing whether the service is running or not running. A sentry that monitors a resource such as disk space or memory may have several states whose severity increases as the availability of the resource decreases.
For each state you can:
- modify the appearance of the sentry on the console by changing the main icon or adding an overlay icon
- provide a range of options and information from the console to help operators resolve problems
- specify background actions such as increased data logging and automatic responses.
A sentry does not have to have a state for every severity level. You can define more than one state for the same severity. Although it is possible to define a large number of states, representing small changes in the sentry, it’s better to have a minimum number of states corresponding to real differences in urgency or severity.
Events
An event is an external incident or condition on a particular host or in a particular application or device that is detected by Sentinel3G and passed to the Event Manager for action. In simple terms, an event is a condition that causes a sentry to move from one state to another state.
Entry condition
The entry condition is a TCL expression made up of any combination of agent variables, constants, text strings, numbers, history variables, boolean values, and TCL functions. Typically an entry condition tests the value of an agent variable against a predefined constant or threshold. Some examples:
- comparing a number to an absolute value: $Count > 1
- comparing a string to an absolute value: $Status == "Unconfigured"
- comparing a variable to a constant: $pct_free < $LOW
- compound expression: $Status == "Off" && $PID != -1
The entry conditions should cover all possible values returned by the agent. If none of the entry conditions is true, the sentry is put in undefined state. If a sentry is in Failed state, it indicates a problem with the agent (usually that it failed to start or has never returned any valid data). If a state’s entry condition is left blank it always evaluates to true.
Copying states
When you add a sentry, you can choose to copy the states of another sentry. If the states for the new sentry need to be similar but not identical, you can first copy then edit them. Changes to the states of the new sentry will not affect the original sentry.
Severity
Each state that a sentry can be in has a severity level, representing how serious the event is. When you define each state, the standard severity levels are listed in order of increasing severity from normal to critical.
The severity determines how the sentry is displayed on the console—its color, and if it has an indicator icon, the color of the indicator and whether it flashes. The severity is also used for notification. A notification message will be sent if the severity of a sentry is greater than or equal to either:
- the global NotifySeverity setting
- the notification level for that sentry
Notes about severities
disabled is a special severity that can be used to indicate when a sentry is ‘down’ or otherwise unavailable, but doesn’t require attention. Examples:
- A device that has been taken offline can be put into a state whose severity is disabled, with console text explaining that it is undergoing maintenance.
- When a group of sentries is not working because of a problem with another sentry, there is no need to have all the sentries showing an alarm over the same problem. For example when Apache or Squid is down, the “status” sentry goes into alarm, but the other (mainly informational) sentries go into a disabled state.
information severity shows operators that the sentry has some useful information to report. This can be used as a state above normal state, where there is no problem serious enough to require going into a warning state or higher.
Note that this is different from an ‘information-only’ sentry, which has no states and only exists to provide information.
Expressions
TCL expressions are used when defining state conditions, console text, and variables.
For example, state conditions include a expression which, when evaluated, returns true or false to indicate whether the sentry is currently in that state.
Expression are written using TCL syntax and can refer to any variables belonging to the sentry’s primary agent or secondary agents. Normally variables are prefixed with "$". However in console text, variables may instead be prefixed with "&" which displays the variable in a formatted form, including any units.
Example:
Disk $disk I/O rate: $io_rate => Disk hd2 I/O rate: 145.7 Disk $disk I/O rate: &io_rate => Disk hd2 I/O rate: 145.7MB/sec
The following tables list other internal variables that are also available for use in expressions.
Variable | Description |
$Sentry | The name of the sentry |
$Class | The name of the sentry's class (aka folder) |
$Host | The sentry's host |
$Instance | The sentry's instance (if any) |
$Group | The name of the instance group (if any) |
$State | The current state that the sentry is in |
$Since | The time that when the sentry last changed state |
$Severity | The current severity of the sentry |
$PrevState | The previous state of the sentry |
$Agent | The name of the primary agent |
$PollTime | The polltime of the primary agent in seconds (polled agents only) |
Table 1 — Internal variables available in sentry and state expressions
Variable | Description |
$Agent | The name of the agent |
$PollTime | The polltime of the agent in seconds (polled agents only) |
$Instance | The agent's instance (if any) |
$data | The value of the variable as received from the agent (raw variables only) |
Table 2 — Internal variables available in raw and derived variable expressions
Actions
Actions are predefined responses associated with a sentry that may be invoked by an operator from the console. Each action is a command that is run on the same host as the host monitor. Actions may be associated with a particular state or may be available at any time.
There are two types:
- An action simply runs the command, and is intended to correct a problem. Example starting a service when it is stopped.
- A report displays the command’s output on the screen, usually in a browser or pager window, and is intended to help the operator diagnose the problem.
You can design a single action to work both on selected instances of a multi-instance sentry and on every instance in a selected parent folder. For example, you can set up an action so that the output for every selected instance is combined into one report.
Tasks that don’t require any action or judgement by an operator and can safely be run automatically are better implemented as responses. Data is passed to an action from the host monitor either by being written to the action’s STDIN, or, if the flag Uses agent data: is set to yes, through the environment variables $Sentry, $Host, and $Action. For multi-instance sentries you can refer to a specific named instance or use $Instance, which contains the instance name of the primary agent.
Note: History data and functions and the & <varname> syntax, which are available in state conditions and console text, cannot be used in an action. To pass the value of a history function, use a derived variable.
If you wish to format a value returned by an agent you must do it manually in the command.
Examples: defining reports
Example 1 shows how to define a simple report for a single-instance sentry, without using any agent variables. When the report is run it will display in a browser window the name of this action (‘Sentry Details Report 1’), the date, the name of the sentry, and the host it runs on.
Action Sentry Details Report 1 Type report Command echo -n "Report '$Action' "; date; echo " Sentry: $Sentry"; echo " Host: $Host" Display command browser Uses agent data? no Reads from STDIN? (N/A) Export to parent? no
In example 2 the agent variables associated with the sentry are exported to the environment (Uses agent data? yes) so that they can be used in the command.
Action Sentry Details Report 2 Type report Command echo "Free space on $Filesystem = $pct_free%" Display command browser Uses agent data? yes Reads from STDIN? no Export to parent? no
When you select a filesystem from the console and run the action, the report will show the free space on that filesystem. If you select multiple filesystems, the command will be run once for each instance, and the output window will show one row for each filesystem.
Example 3 demonstrates another way to make data available to an action, this time by reading from STDIN. This passes any agent data to the sentry in Functional Database format (a plain-text table, with rows separated by a newline and fields separated by a tab).
Action Sentry Details Report 3 Type report Command cat - Display command db_scroll Uses agent data? yes Reads from STDIN? yes Export to parent? no
When you select a filesystem from the console and run the action, the report will show the raw database row containing the filesystem variables. To read this in a script, you would then need to use db_readrow, a Functional Toolset program.
Use this option if you are familiar with the Functional Toolset and wish to use it to manipulate the data. With this method, unlike the previous example, the command is only run once. The database rows are accumulated before piping them to the Command. Try selecting multiple filesystems and running the action. Note that there is one header row and multiple data rows.
In Example 4, you export the action to the parent folder (Export to parent? yes). This makes the action available from the context menu when the operator clicks on the folder background (that is, no sentries are selected), or on the parent class folder.
Action Sentry Details Report 4 Type report Command echo "Free space on $Filesystem = $pct_free%" Display command browser Uses agent data? yes Reads from STDIN? yes Export to parent? yes
When you run this action by clicking on the background of the folder or on the parent class folder, it is the same as selecting all instances and then running the action. If the action were configured on a single instance sentry, it is the same as selecting that single sentry and running the action.
Example: defining an action
Example 5 runs a command to stop the service represented by this sentry.
- This is an action and not a report, so there is no Display command.
- The command, called system_service, could cause system problems if run incorrectly, so when run from a shell it requires root privileges. Therefore Run as user is set to root so that when it is run from within Sentinel3G it has the necessary root privileges.
- Access role is set to Manager so that only Sentinel3G users with the Manager role can run the action.
- Authenticate=yes means that the user’s password must be entered. This is a further security measure to ensure that the action cannot be run by an unauthorized person from the Sentinel3G Manager’s workstation.
- This action is only useful if the service is really running, so In state(s) is set so that the action is only presented to the operator if the sentry is in Running state or Confused state (which means the service is turned off but still running).
- Uses agent data? is set to yes so that the command can obtain the variable containing the name of the service.
Action Stop service Type action Command system_service $Filename stop Access role Manager Authenticate yes Run as user root In state(s) Confused Running Uses agent data? yes
Responses
Responses are commands that are run automatically by the Host Monitor when a sentry is in a particular state. You can define a series of responses for each state that are tailored to the severity of the problem.
Each response may run immediately, or there may be a waiting period after the sentry first enters this state or after the running of a previous response. Figure 4 shows an example of the full set of responses defined for a sentry while it is in warning state.
Each response period is cumulative. In other words the period for Response #2 is counted from the end of the period for Response #1. Example: Response #1 is defined to go to a new severity of warning after 120 seconds. Response #2 is defined to notify after 60 seconds, which will be 180 seconds after the sentry entered this state.
The response Command can attempt to remedy a situation. If successful it will typically return the sentry to a normal state. If the Command does not succeed, you may choose to leave the sentry in that state, and specify a later response to run another command or to notify someone.
Another possible response is to force an agent to be polled at the end of the response period. This is called ‘firing’ the agent. You can fire the primary agent to refresh the variables used by the sentry, or fire another agent to collect additional data.
Where a sentry experiences occasional temporary situations which usually correct themselves quickly, you may not want to take action or be notified unless the sentry has been in that state for some minimum period.
If a sentry changes state while it is waiting to process a response (that is, before the end of the waiting period), then all responses for this state are cancelled, and any responses for the new state are started.
Example: as free disk space in a filesystem reaches a dangerously low level, Sentinel3G can run a series of commands such as:
- writing to currently logged-in users asking them to remove surplus files
- archiving files to an offline storage device
- removing files deemed expendable, such as files named core and *.o.
Any helpful task that can safely be run without prior checking can be set up as an automatic response to an event. Tasks that require some action or judgement by an operator are better implemented as actions.
Escalation
Another way to respond to an alert is simply to wait for a while to see if the problem corrects itself, then to change to another state at the end of that period.
For example, a sentry may be defined to wait up to 300 seconds in warning state, then to change to alarm state. The change of state may depend on manual confirmation from an operator (Acknowledgement) or it may happen automatically (Escalation).
If the problem is normally transient and self-correcting, you could put the sentry into a warning state for a few minutes. At this point the appearance of the sentry is simply a passive signal that the sentry is not in its normal state. If the sentry is still in warning state at the end of this period, it indicates that the problem is unlikely to resolve itself. In this case you could change the sentry to a more severe state with its own set of responses.
In other cases you might return the sentry to a normal state if no other events have occurred by the end of the period. For example, a warning message appearing in a system log file may indicate a potential performance problem, but if no other messages are logged in the next few minutes it may be safe to return the sentry to normal state.
Another use for escalation is to “chain together” several responses by splitting them over two states. Each state has a maximum of three responses.
Note that it may take several seconds for the escalation to be processed at the end of the waiting period.
Acknowledgement
A sentry may request acknowledgement from an operator before changing to another state. This is usually done to confirm that an operator has been made aware of a probable “one-off ” incident before returning the sentry to normal state. For example, if the Bad_SU sentry detects a single failed attempt to gain root privileges, it remains in Report state until:
- It receives acknowledgement from an operator and returns to its normal state
- It detects another failed su attempt and goes to Violation state
Prompting for acknowledgement verifies that an operator was made aware of the condition at the time, which can be useful for audit or training purposes. You should provide monitoring notes to help operators understand what their options are when the sentry is in this state, and what will happen next if they acknowledge the alert.
If a sentry is waiting for acknowledgement this overlay icon will appear next to it.
Notification
Sentinel3G can notify a list of staff by e-mail when an event is detected. This is a useful way to alert staff who do not normally run or are not currently running a console. There are three layers or types of notification:
- Global notification is triggered when any sentry goes into a state at a specified severity level or higher. This would normally be used only for the most serious alerts, to avoid staff being flooded with messages about routine events. An example would be to prompt at least one person in the operations group to check the console to find out more about the problem. Global notification lets you specify a blanket notification policy in one place rather than having to set it for every sentry.
- Sentry-level notification occurs when a particular sentry goes into a state at a specified severity level or higher. For example, you can set global notification to occur when any sentry goes into a state whose severity is severe or critical, but override that for a particular sentry. When that sentry goes into a state whose severity is alarm notification should be sent. Sentrylevel notification lets you supplement the global notification policy by changing the notification level for selected sentries.
Note that operators can disable notification for selected sentries from the console.
- State-level notification can be specified as one of the predefined responses when a sentry goes into a particular state. This can be used to implement a follow-up response where the first response fails to correct the problem.
Figure 5 shows a scheme that combines global and sentry-level notification. The NotifyLevel setting is set to severe, so global notification will normally be triggered by any sentry that goes into a state whose severity is severe or critical. There are two exceptions to this: SentryB will send a notification message (perhaps to a different list of recipients) if it goes into a state whose severity is alarm or higher; SentryC will send a notification message only if it goes into a state whose severity is critical.
Figure 5 — Example of both global and sentry-level notification
Figure 6 shows an example of state-level notification. This sentry waits for 300 seconds after entering low state, then runs a script to try to fix the problem. If the sentry is still in low state after another 120 seconds, a notification message is sent to recipients in opsgroup.
State: sufficient
Severity: normal | ||
State: low
Severity: warning | Response 1:
After 300 secs: | Command:
/usr/local/bin/rmtmpfiles |
Response 2:
After 120 secs: | Notify:
opsgroup | |
State: very_low
Severity: alarm |
Figure 6 — Example of state-level notification
Global notification is the simplest form to implement as it is set in one place and applies to all sentries. In more complex environments where different people should be notified when different events occur, it may be more appropriate to configure notification at the sentry, instance group or state level.
Agents and Variables
Agents collect data on behalf of sentries. A typical agent works by polling, or running a command at regular intervals. Each time the command runs, its output is stored in a number of variables. These variables are passed to the host monitor to be processed on behalf of sentries, for example to evaluate what state the sentry is in and to display data on the console.
Other types of agent don’t poll but simply wait to receive data, for example from:
- the Logfile agent
- an existing application that has been instrumented through the Host Monitor API to send data for monitoring direct to the Host Monitor.
Primary and secondary agents
Each sentry has one agent, called its primary agent, that supplies most or all of its variables. A sentry can also access variables belonging to other agents, which are called its secondary agents. Variables are simply referred to by name: $pct_free, $count. If a primary agent and a secondary agent both have a variable with the same name, the primary agent’s variable is used.
There is an important difference between primary and secondary agents that you should be aware of. A sentry’s state evaluations are normally done when its primary agent returns data, not when the secondary ones do. This can lead to some unexpected behaviour if the secondary agent data is out of date.
For example: there are two agents. The first agent monitors whether the Staff database is up or down. The other agent monitors whether the Payroll application (which happens to use the Staff database) is up or down. There is a sentry for the application. This sentry has different states to distinguish between the Payroll application being down because the Staff database is down, and the application being down for another reason.
You would configure the Staff database agent as a secondary agent so you could use the "is_up" variable that belongs to it. However, if the poll times of the two agents are not the same (and they usually won't be) there is a potential problem. You can have a situation where the application agent reports that the Payroll application is down because it has detected that the Staff database is down, but the database agent hasn’t had its poll yet, and still ‘thinks’ the Staff database is up.
The solution is to ‘trigger’ the sentry, which forces its state to be reevaluated when the secondary agent returns new data. The effect is to force the primary and secondary agents to synchronize their polling.
Discovery program
This is an optional command that is run before the agent starts. Its job is to return an exit status of true or false based on the existence or status of a resource. If the discovery program returns false, this agent and its associated sentries will not be started. This means the same set of KBs can be installed on several servers, and an agent on a particular server can be switched off if it ‘discovers’ that the resource it monitors is not present.
Here are two examples:
- The discovery program for an Oracle agent checks whether Oracle is installed on a server by testing for the existence of a particular executable or directory. If Oracle is not installed on that server, the Oracle monitoring agent is not started.
- A network monitoring agent pings other hosts to check whether the host is up and communications are working. There’s no need to have more than one host sending pings as they should all return the same answer. To make sure only one host tests connectivity, the discovery program tests whether it is running on the Event Manager host: ["$EventHost" = "$HOSTNAME"] On every other host the discovery program returns false and the agent doesn’t start.
Monitoring file updates: the FileInfo agent
Sometimes you may need to monitor when a particular file or files change in some way. For example, you could log when a system file such as the password file has been updated and perhaps generate an alert.
Sentinel3G provides a standard agent that can monitor these events:
- The file has been created or deleted
- Any change to the file (modification time has changed)
- Size has changed
- Ownership has changed
- Access permissions have changed
"Windowing" States Based on Time: the Clock Agent
The Clock agent can be used to stop a sentry from monitoring during particular periods. For example, if you run batch jobs between 11pm and 6am daily that use lots of CPU, you don't want to be notified if the run_queue gets too high as this is expected during these times. So you "window" the monitoring.
Add the Clock agent as a secondary agent to the sentry you wish to window (in our example, Run_Queue).
Add a new state to the sentry, called Not_Monitored, and give it a severity of Disabled. In the Condition field, enter a boolean expression describing the time you want to exclude the sentry from monitoring. In our previous example, where the batch jobs run between 11pm and 6am, this would be:
$Hour >= 23 || $Hour < 6
Make sure that the Not_Monitored state appears at the top of the list of states so that its condition is evaluated first.
If the requirement was to disable monitoring during 11pm - 6am Monday to Friday only, it gets a bit more complicated, because you need to remember that Friday night's batch jobs actually go until 6am on Saturday:
$Hour >= 23 && $DayOfWeek >= 1 && $DayOfWeek < 6 || $Hour < 6 && $DayOfWeek > 1 && $DayOfWeek <= 6
Table 2 lists the variables you can use to window monitoring for a sentry.
Variable | Type | Description |
Day | number | Day of the month (1-31) |
DayName | string | Name of the day of the week, capitalized, e.g. Monday |
DayofWeek | number | Day of the week as a number (Sunday = 0) |
DayofYear | number | Day of the year (1-366) |
Hour | number | Hour of the day (0-23) |
LastDayofMonth | boolean | True when today is the last day of the month |
LastWeekofMonth | boolean | True when within 7 days of the end of the month |
Minute | number | Minute in the hour (0-59) |
Month | number | Month as a number (1-12) |
MonthName | string | Name of the month, capitalized, e.g. June |
Time | clock | Number of seconds since 1st January 1970 GMT |
TimeOfDay | string | Time in the form HH:MM (00:00 - 23:59) |
TimeZone | string | Timezone configured on the system as a string, e.g. GMT |
Week | number | Week of the year (0-52), week begins Sunday |
Year | number | 4 digit year, e.g. 2003 |
Table 2 — Clock agent variables
Process Monitoring: the ProcessInfo Agent
The ProcessInfo agent provides data for a process monitoring console on each host.
The ProcessInfo agent returns data about processes running on the local (Host Monitor) host [see ps(1)]. It is typically used to determine whether a process is running or to monitor its CPU or memory usage. ProcessInfo is a multi-instance agent, whose instances are the usually the names of the processes being monitored.
They must be specified in the Instances field of each sentry using this agent.
Processes are matched to instances by doing pattern matches on the Command field (as returned by the ps -efl command). If the Agent data field of an instance is NULL, then an exact string match is performed using the instance name. Otherwise the Agent data is interpreted as an unanchored full regular expression.
Note that if one instance matches more than one process, only details of the first process found are returned. However the count variable is set to the number of matching processes.
The variables returned by the ProcessInfo agent are:
- command
- The command running (the full name including command line options). Note that it maybe truncated if long.
- count
- The number of matching processes found.
- cpu
- The number of CPU seconds used by the process.
- pid
- The numeric process ID.
- ppid
- The numeric parent process ID.
- priority
- The numeric priority at which the process is running.
- size
- The size of the memory image of the process.
- state
- The state of the process (see ps(1)).
- tty
- The controlling terminal of the process.
- user
- The name of the user owning the process.
Agent Classes and Variables
Agents make data available to sentries in the form of variables. The agent class tells sentinel3G the format and location (e.g. STDOUT, a file name) of the agent data, how to parse it, and how to assign key data to variables.
The format of the agent output, and the way you identify which part of it to assign to a variable, differs depending on the agent class. This topic explains the attributes of each agent class.
API
An external application sends data via the Sentinel3G API. The application must be instrumented to send a string of variable names and their values to the host monitor at certain processing points, such as when a transaction is committed.
The API class is different from other agent classes in that data is ‘pushed’ to the host monitor at intervals decided by the external application, rather than being ‘pulled’ in by Sentinel3G. Therefore you don’t specify a column name when adding a variable. Instead you define one variable for each varname= value pair that is passed in the SENAPIdata command by the external application.
DB
The agent returns data in Functional Database format (a set of one or more records, each containing text fields delimited by tabs and terminating in a newline). Typically the data comprises several fields or one or more whole rows returned as a result of a query on a Functional Database table.
Each column name that you assign to an agent variable is a field name as specified in the Functional Database dictionary entry.
ExitStatus
The agent returns the exit status of the command. This can be used to monitor scheduled processes such as batch jobs and backups where there are a few common exit statuses, each relating to a different error condition. Example: when a backup job fails, the sentry can translate the exit status into a meaningful console message (such as "media change failed" or "error writing to device") and provide appropriate responses and actions.
You don’t need to specify a column name when adding a variable to store the exit status. Instead you define one variable of type raw, leaving the Column field blank. The exit status of the agent command will automatically be assigned to this variable.
LogFile
A convenient way of detecting events in an existing application with minimal intrusion is by monitoring its log file(s) for certain messages. The LogFile agent class allows alarms to be generated based on the contents of log files such as:
- Operating System logs
- Unix/Linux syslog
- Bad login attempts
- COSmanager™ audit trails
- Third-party applications
The agent searches in the file for messages that match a pattern. In the Agent options form you can specify the file name, a select pattern to select records of interest, and an extract pattern for each text string in the record that must be assigned to a variable.
The log file may contain a mixture of messages of different types but typically we are only interested in one type. If you are interested in differently formatted messages you could define one agent per record type.
Agents in the Logfile class generate one or more lines of text output, such as an error message. Table 3 explains how to split the data into patterns or columns. Table 4 explains how to assign each column to a variable.
SNMPPolled
The agent polls for the results of SNMP ‘Get’ requests. Typically these requests test the current status of a managed object in an SNMP MIB, such as a device or port.
Each column name that you assign to an agent variable must be an object ID as specified in the SNMP MIB.
Text
This is used to filter the output from a command. The agent runs the command, which writes text output to STDOUT. If the output is complex or split over several lines, you can use the Agent options form to filter out extraneous text such as blank lines, header lines, and labels.
Agents in the Text class generate one or more lines of text output, such as a formatted report. Table 3 explains how to split the data into patterns or columns.
Assigning Text and Log File Data to Variables
For agents in the Text and Logfile class, the data is split into one or more fields, which are identified by number. How the fields are split is determined by the Split data by field in the ‘Agent options’ form, as shown in Table 3:
Split data by | Notes |
column | The line is not split into fields. All variables must be identified by character position on the line. |
whitespace | The line is split into a series of fields separated by whitespace. The first field is column 1, the second is column 2, and so on.
Example: to assign to the variable the characters from the start of the line to the first whitespace character, enter 1 in the Column field. |
tab | The line is split into a series of fields, each separated by a tab. The first field is column 1, the second field is column 2, and so on.
Example: to assign to the variable the characters between the first and second tab, enter 2 in the Column field. |
pattern | Specify in the Pattern line fields of the Agent Options form one or more extract patterns. The first extract pattern is treated as column 1, the second extract pattern column 2, and so on.
Example: to assign to the variable the string that matches the third extract pattern, enter 3 in the Column field. |
Table 3 — How agent data is split into one or more numbered fields
When you add a variable, you specify in the Column field which of these columns to assign to the variable.
For agents where the data is split pattern, you simply enter the column number <col>. If the data is split by whitespace or tab, you can enter a single column number if the agent returns all data on one line. If the agent returns several lines of data, you can specify a particular line by prefixing the column number with the line number, like this: <line>: <col>.
For agents where the data is split by column, each line is treated as a string of characters. You must specify in the Column field a range of characters in the form c <pos>-<pos>. If the agent returns several lines of data, you can specify a particular line by prefixing the character range with the line number, like this:
<line>:c <pos>-<pos>
Column field | Split data by | Notes |
3 | pattern | the third extract pattern in the output |
2:3 | whitespace | the third field in the second line of output |
c10-40 | column | the tenth to the fortieth characters inclusive |
3:c10-11 | column | the tenth and eleventh characters on the third line of output |
2:c4-end | column | from the fourth to the last character on the second line |
Table 4 — Examples: assigning columns to a variable
Note: The Agent Options form includes several fields (e.g. Clear pattern, Skip initial lines, Strip initial chars) that allow you to strip unwanted lines and characters from the output before assigning columns to variables. All processing of whitespace, tabs, column numbers, or patterns, takes place after these fields have been processed. For example if Skip initial chars = 6, the column that the variable sees as c1 would actually be the 7th character in the original data (assuming that Skip pattern hasn’t removed even more characters).
Trigger Variables
A trigger variable is used to compare an earlier value of a variable with its current value. Trigger variables ‘remember’ a value from an earlier poll. (The name comes from the way in which the saving of the variable is triggered by a state change.) All other variables are set or recalculated every time the agent returns data, meaning that the previous value is overwritten.
Example: batches of update transactions are added to a data file once or twice a day. You want to write a sentry that notifies you whenever the spool file’s modification time changes. Using the FileInfo agent, you create a raw variable called mtime to store the current modification time, and a trigger variable called prev_mtime.
The Initial value of prev_mtime is set to $mtime, and the Expression is also set to $mtime.
Next you create a sentry with two states:
- Normal state has prev_mtime in its list of Trigger vars.
- NewTrans state has no Trigger vars. Its Entry condition is "$mtime != $prev_mtime". It has a response to return to normal state after receiving acknowledgement from an operator.
When the file changes, the operating system updates its modification time. The sentry detects that the new value for $mtime is different from $prev_mtime, and changes to NewTrans state. When an operator acknowledges the event, indicating that the new transactions have been noted, the sentry is returned to normal state.
At this point the trigger variable is recomputed, setting $prev_mtime to the new $mtime.
History Variables
History variables store the recent values returned by an agent variable. They can be used to generate a realtime graph showing recent changes the data, and in state conditions, to average out spikes and gaps in the data. You can keep either a set number of values, or keep all values in a set period.
Note that history is suited to fairly short-term analysis. For longer term analysis such as capacity planning, use logged variables.
Using history variables in realtime graphs
You can generate a realtime graph showing recent changes in a variable. If a variable’s history has been saved, and the graph is defined to graph the last N values, it will use the variable’s history to get these values (or as many as there are available).
Using history variables in state conditions
You can use history variables in state conditions to handle exceptions in the data such as ‘spikes’ (high or low values which are transient and do not need to be displayed or acted on), and to calculate an average rate to use where the data is temporarily not being returned from the agent.
Functions for Accessing History Variables
This topic describes the methods or functions that can be performed on history variables. You would typically use these in an expression, either in a state condition or in a derived variable.
History variables can be thought of as an array containing two fields: the value and the time. The array is accessed backwards in time: index 0 is the current value, 1 is the previous value, and so on. It can be written as two arrays: Hval[n] and Htime[n]. You can reference the variable’s history by putting "@" in front of the variable name (normally you refer to a variable by putting a "$" in front of the name, which returns the current value). You access history variables using one of the predefined methods. The TCL syntax is:
[@ <hist-var> <method> <optional params>]
Example: [@cpu_idle_hist value 1]
This returns the previous value (index 1) of the history variable cpu_idle_hist. (Note the square brackets around the call to the method). You can use this within an expression or condition:
[@cpu_idle_hist value 1] / 100.0
Table 5 lists the functions available to process history variables.
Function | Description | |
value <index> | Return the value of a particular index of a history variable. Omitting the index will return the most recent element. | [@accesses_kb value] = 0
[@accesses_kb value 14] = 12.5 |
value_at <clock> | The value at a given time (clock format) | [@free_space value_at 1052824642] = 5 |
average | Return the arithmetic mean of a history variable over its entire history | [@response_time average] = 9.5 |
max | Return the maximum history value (of the current values) | [@cpu_usage max] = 17.0 |
min | Return the minimum history value (of the current values) | [@raw_packets_out min] = 0.0 |
earliest_time | The value of Htime[ end] where end is the oldest value | [@cpu_usage earliest_time] = 50 |
diff <index> | Return the difference between the most recent element and the element at the specified index. Omitting the index returns the difference between the most and least recent elements. | [@cpu_usage diff 5] = -25.4
[@free_space diff] = 15 |
rate <index> | The elements are averaged over the time between the elements. Whatever the unit of the history variable, the result is always “units per second”. For example, if the history is in “MB”, the result will be in “MB per second”. | [@free_space rate] = 0.6
[@file_size rate 5] = 1.5 |
diff_rate <index> | This is similar to rate, but is but used for agents that return a cumulative number. First the difference between the elements is calculated, then this is divided by the time. | [@total_kbs_sent diff_rate] = 1.5
[@cum_cpu_time diff_rate 1] = 0.2 |
count <condition> | Return the number of values that match a simple condition, such as “< 7”. If <condition> is omitted a count of the current number of history values is returned. | [@free_space count "< 150"] = 1
[@cpu_idle_hist count “> 90”] (This returns the number of values whose value is greater than 90.) |
percent <condition> | This is similar to count, but returns the percentage of values meeting the condition. | [@response_time "< 2000"] = 9.5 |
Table 5 — Functions for accessing history variables
Notes:
- “end” refers to the oldest index available. You cannot actually use the string “end” in your expressions.
- Generally, if index is not specified, end is assumed.
- Although you can have history on any type of variable, some methods such as average assume that the variable is a number.
- If a function references a variable that does not exist, the Host Monitor log file will display a message and the agent or sentry will not be started.
Using Standard TCL Functions
To use a function, enclose it in square brackets. For example: [string range "hello" 0 1] will return the value of “he”. All standard TCL functions are available.
Using variables in functions
Scalar variables are referenced by adding a $ to the front. For example, if you had a string called hostname, you could convert it to uppercase using this command: [string toupper $hostname].
History variables are referenced by adding an @ to the front. History variables cannot be used in standard TCL functions, only the functions mentioned below. History functions are accessed slightly differently than scalar functions in that the variable comes first. For example, if you wanted to calculate the average over a history variable called response_time you would enter:
[@response_time average]
If the function takes parameters, they come after the function name:
[@my_history function <args>]
History variables are indexed in reverse order. The newest element will always be at index 0. Therefore index 1 is the second most recent and index 5 is the sixth most recent element.
Additional scalar functions
Table 6 lists some additional functions. The examples all use numbers, but you can replace any of these with scalar variables (e.g: $my_variable). If you reference a variable that does not exist, the Host Monitor log file will display a message and the agent or sentry will not be started.
Table 6 — Additional TCL functions
Function | Description | Examples |
percent <value> <total> | Divide <value> by <total> and return answer as a percentage.
If <total> is zero, the return value is always zero. | [percent 45 100] = 45.0
[percent 75 150] = 50.0 [percent 100 0] = 0 |
div <v1> <v2> | Safely divide <v1> by <v2>.
If <v2> is zero, the return value is always zero. | [div 5 10] = 0.5
[div 10 0] = 0 |
round <n> <multiple> | Round <n> to the nearest <multiple>. | [round 12345 100] = 12300
[round 1.2345 0.01] = 1.23 [round 1.3 0.5] = 1.5 |
hostname <IP address> | Return the host name for a given IP address | [hostname 99.99.99.99] = www.s3gxyz.com |
clock_to_db <secs> | Convert a time in “clock seconds” to internal date/time (YYYYMMDD.hhmmss). | [clock_to_db 1052824642] = 20030513.121722 |
db_to_clock <date> | Convert internal format date/time to “clock seconds”. | [db_to_clock 20030513.121722] = 1052824642 |
fmt_clock <secs> | Convert “clock seconds” date/time to display format. | [fmt_clock 1052824642] = 13/05/03-12:17
[fmt_clock 0] = 01/01/70-01:00 |
fmt_boolean <val> | Format boolean type for display. | [fmt_boolean 0] = false
[fmt_boolean 1] = true |
fmt_date <date> | Format date type (YYYYMMDD) for display. | [fmt_date 20030513] = 13/05/03 |
fmt_datetime <date> | Format datetime type (YYYYMMDD.hhmmss) for display. | [fmt_datetime 20030513.121722] = 13/05/03-12:17 |
fmt_uptime <secs> | Format uptime (in seconds) for display. | [fmt_uptime 0] = 0 secs
[fmt_uptime 300] = 5.0 mins [fmt_uptime 4000] = 1.1 hrs [fmt_uptime 100000] = 1.2 days |
Constants and Thresholds
Constants are like variables, but unlike variables are associated with a sentry rather than an agent. They are typically used to set different thresholds for sentries that share the same states. This is useful for monitoring different “sizes” of the same type of resource using different criteria. Constants are by convention on UPPER CASE to differentiate them from variables.
How it works: States refer to constants by name, so the same set of states can be shared between the two sentries. Because constants are not shared, you can set the constants of the two sentries to different values.
For example, the acceptable minimum amount of free space on a filesystem depends on its size and volatility. 3% may be an acceptable threshold for a fairly static 100 GB filesystem, but dangerously low for a volatile 15 GB filesystem. In this case you could have each sentry sharing the same states, including a state called Very_Low.
The entry condition for this state would test the current value reported by the agent against a constant, called VERY_LOW. The difference is that for the large_filesys sentry, the constant VERY_LOW is set to 3%, while for the small_filesys sentry, the constant VERY_LOW is set to 8%:
sentry | state name (shared) | entry condition for this state | VERY_LOW constant |
small_filesys | Very_Low | $pct_free < $VERY_LOW | 8 |
large_filesys | Very_Low | $pct_free < $VERY_LOW | 3 |
Table 7 — Different thresholds for sentries that share the same states
Constants may also be used as a visual aid on realtime graphs.
Note: The values of constants may also be set in Instance Groups, and these will override the values defined in the sentry but only for sentries in that particular instance group. In fact, this example is probably better implemented using a single sentry with two Instance Groups.
Sentinel Processes and Configuration
Event Manager
The Event Manager is a central process that collects state information from all host monitors and updates the icons and data on the consoles as required.
Host Monitor
The host monitor is the main processing ‘engine’. One host monitor process runs on each Sentinel3G host. The program is responsible for:
- filtering data received from its agents
- detecting when events occur and notifying the Event Manager
- performing logging of variables and state changes
- performing notification and other automatic responses
Knowledge Base
A sentry monitors a particular component or subsystem of your operating system, hardware and applications. A folder is a related set of sentries. Sentries and folders are themselves grouped into knowledge bases.
Several knowledge bases are available for Sentinel3G, including knowledge bases for operating systems, databases, web servers and applications such as COSmanager.
You can add your own custom knowledge bases to hold details of sentries you define yourself.
Host Monitor API
The Host Monitor API can be used to instrument existing applications to send data for monitoring direct to the Host Monitor, rather than having to write a script or program which polls for this data. This is an extremely flexible interface—you just need to tell Sentinel3G what variables to expect, and their types.
Logging
Sentinel3G maintains the following types of log file:
- Each sentry (and optionally each instance) has an associated log file. This records the state changes for the sentry or instance.
- The Event Manager maintains a status log file (EventMgr) and a data log file (EMdata).
- Each Host Monitor maintains separate logs for status changes (HostMon) and for data (HMdata). The data log records all sentry state changes, and the values of agent variables which have been configured to be logged.
All records are time-stamped.
The status logs EventMgr and HostMon can be viewed from the Logs menu on the console.
Note that there is some overlap in the state change logging done by the Host Monitor and the Event Manager. This is to ensure that even if the Event Manager is down, that state changes are still logged, as well as to keep the network traffic to a minimum.
Global default settings for logging
To conserve disk space, you can control the amount of data that is logged. Agent variable logging can be varied according to the state of a particular sentry. While a sentry is operating normally we are not interested in the exact values being returned by the agent. Therefore, at lower severity levels there is little need to collect data beyond recording state changes. At higher severity levels such as alarm you may wish to log variable values more often, perhaps once every poll.
In the global Sentinel settings you can specify both a logging frequency (DefLog-Time) and the minimum severity level at which it operates (DefLogSeverity).
There is also the option to specify different settings for particular sentries. DefLogTime specifies how often to log data for use in logged data reports. At this interval, the latest data values will be written to disk. DevLogSeverity is the minimum severity at which to start logging data every poll.
For example, if you have specified a log time of 30 minutes and a minimum severity of alarm, under normal conditions Sentinel3G logs a single data point every 30 minutes. When the sentry goes into a state whose severity is alarm or higher, every data point is logged until the severity goes back below the minimum.
Reports on logged data
Reports are provided to extract and summarize data from the data logs, and to graph the value of numerical data. The Service Level Report searches the EMdata log and produces a summary of the amount of time the selected sentries have spent in each state or severity. The Logged Data Report searches the HMdata log for recorded values of particular variables. You choose the variables to display, and a line graph for those variables will be drawn over the chosen period.
Managing log files
The management of logs is integrated with the COSmanager™ audit trail facility, which provides for viewing and cycling (pruning) of log files.
If COSmanager is not installed, an automatically scheduled task such as a cron job should be set up to regularly compress and archive a copy of each log file. Once copies have been archived the original logs can be reset to save disk space.
Access Control via Roles and Capabilities
Each Sentinel3G user has one or more roles. Each role identifies a responsibility or class of users in your organization, such as Manager or Operator. Roles are defined in terms of the access capabilities they grant. In turn, capabilities determine what menu options and actions a user can perform.