FS
Documentation

Sentinel3G Concepts

From Documentation

(Difference between revisions)
Jump to: navigation, search
Revision as of 07:06, 14 June 2013
Mike (Talk | contribs)
(Expressions)
← Previous diff
Revision as of 07:23, 14 June 2013
Mike (Talk | contribs)
(Functions for Accessing History Variables)
Next diff →
Line 815: Line 815:
|value <index> |value <index>
|Return the value of a particular index of a history variable. Omitting the index will return the most recent element. |Return the value of a particular index of a history variable. Omitting the index will return the most recent element.
-|[@accesses_kb value] = 0+|[@accesses_kb value] => 0
-[@accesses_kb value 14] = 12.5+[@accesses_kb value 14] => 12.5
|- |-
|value_at <clock> |value_at <clock>
|The value at a given time (clock format) |The value at a given time (clock format)
-|[@free_space value_at 1052824642] = 5+|[@free_space value_at 1052824642] => 5
|- |-
|average |average
|Return the arithmetic mean of a history variable over its entire history |Return the arithmetic mean of a history variable over its entire history
-|[@response_time average] = 9.5+|[@response_time average] => 9.5
|- |-
|max |max
|Return the maximum history value (of the current values) |Return the maximum history value (of the current values)
-|[@cpu_usage max] = 17.0+|[@cpu_usage max] => 97.0
|- |-
|min |min
|Return the minimum history value (of the current values) |Return the minimum history value (of the current values)
-|[@raw_packets_out min] = 0.0+|[@raw_packets_out min] => 0.0
|- |-
|earliest_time |earliest_time
-|The value of Htime[ end] where end is the oldest value +|The value of Htime[end] where end is the oldest value
-|[@cpu_usage earliest_time] = 50+|[@cpu_usage earliest_time] => 1171194067
 +|-
 +|earliest_value
 +|The value of Hval[end] where end is the oldest value
 +|[@cpu_usage earliest_value] => 51.6
|- |-
|diff <index> |diff <index>
|Return the difference between the most recent element and the element at the specified index. Omitting the index returns the difference between the most and least recent elements. |Return the difference between the most recent element and the element at the specified index. Omitting the index returns the difference between the most and least recent elements.
-|[@cpu_usage diff 5] = -25.4+|[@cpu_usage diff 5] => -25.4
-[@free_space diff] = 15+[@free_space diff] => 15
|- |-
|rate <index> |rate <index>
|The elements are averaged over the time between the elements. Whatever the unit of the history variable, the result is always “units per second”. For example, if the history is in “MB”, the result will be in “MB per second”. |The elements are averaged over the time between the elements. Whatever the unit of the history variable, the result is always “units per second”. For example, if the history is in “MB”, the result will be in “MB per second”.
-|[@free_space rate] = 0.6+|[@free_space rate] => 0.6
-[@file_size rate 5] = 1.5+[@file_size rate 5] => 1.5
|- |-
|diff_rate <index> |diff_rate <index>
|This is similar to rate, but is but used for agents that return a cumulative number. First the difference between the elements is calculated, then this is divided by the time. |This is similar to rate, but is but used for agents that return a cumulative number. First the difference between the elements is calculated, then this is divided by the time.
-|[@total_kbs_sent diff_rate] = 1.5+|[@total_kbs_sent diff_rate] => 1.5
-[@cum_cpu_time diff_rate 1] = 0.2+[@cum_cpu_time diff_rate 1] => 0.2
|- |-
|count <condition> |count <condition>
|Return the number of values that match a simple condition, such as “< 7”. If <condition> is omitted a count of the current number of history values is returned. |Return the number of values that match a simple condition, such as “< 7”. If <condition> is omitted a count of the current number of history values is returned.
-|[@free_space count "< 150"] = 1+|[@free_space count "< 150"] => 1
[@cpu_idle_hist count “> 90”] [@cpu_idle_hist count “> 90”]
Line 867: Line 871:
|percent <condition> |percent <condition>
|This is similar to count, but returns the percentage of values meeting the condition. |This is similar to count, but returns the percentage of values meeting the condition.
-|[@response_time "< 2000"] = 9.5+|[@response_time "< 2000"] => 9.5
|} |}
Table 5 — Functions for accessing history variables Table 5 — Functions for accessing history variables

Revision as of 07:23, 14 June 2013

This section explains the main concepts underlying Sentinel3G.

Contents

How Sentinel3G Works

Sentinel3G has four main functions:

Sentries monitor resources or processes such as devices, subsystems or applications.

Agents collect data about sentries on each host on behalf of the host monitor. An event occurs when a sentry being monitored changes state. Typically, events are diagnosed by a host monitor from data supplied by agents. The state change could result from a single raw value crossing some predetermined threshold; or it could be a trend derived from the raw data, such as the rate of change of a value. Sentinel3G includes file-watching agents that detect events in system files and log files.

Host monitors report events to the Event Manager on the central host.

The Host Monitor takes some action based on the severity of the event, such as running a predefined command. Persistent problems that can’t be resolved automatically are escalated and passed to operations staff for action. Staff are notified of events by a console, and if necessary by some other means such as e-mail.

A console is a kind of ‘heads-up’ display that gives a concise hierarchical view of the current state of the sentries being monitored. It is both a means to alert operators of an event and a means for them to monitor and respond to events.

Consoles can present information in customized views, such as by region, by host, or by function. Different classes of user see an appropriate level of detail: from a broad enterprise-wide summary for managers to fine detail for operators and enduser administrators. Console users can select a predefined view or sort and filter sentry data to help diagnose problems. Reports provide more details about the current state of the sentry. Graphs chart the changes in the value of agent variables and provide a recent history of the state of the sentry.

At any time, a user can see from the console what events are happening and how serious they are. If reports have been defined, the operator can choose to view them. If there are predefined actions attached to the sentry the user can choose one to run.

Related information about an application or system component (basically its sentries, agents, events and responses) is grouped into a knowledge base. Sentinel3G includes a UNIX/Linux knowledge base. Other knowledge bases (such as Oracle, Windows NT/2000/XP, Network Services) are available as add-ons.

Because events are detected on each local host, only state transitions and not raw data need to be reported to the central host, so data traffic is minimal. This means that Sentinel3G itself adds little extra load to your system, even when monitoring large networks.

The Console

The console is the primary user interface to Sentinel3G, and is the most common way for Sentinel3G to report events and for users to monitor sentries and respond to events. A console displays a series of sentries as icons, which are grouped hierarchically into folders. Different users may have different console views, and different privileges controlling what they are able to see and do.

Day-to-day console operations are described in Monitoring Sentries From the Console.

Overlays and Indicator Icons

Each sentry and folder is represented on the console by an icon. Overlays are small icons that modify the appearance the main icon to show:

This topic lists all of the overlay icons by type and gives a brief description.

Overlays that represent the type of a sentry or folder Indicators that represent a sentry’s state If an indicator is not specified for a state, the default indicator specified in the sentry (thermometer or pie chart) will be used. See Indicators that represent data values from a sentry’s variable.

If no overlay is specified in the sentry or state details, the default overlay icon and color for the current severity is used. See Indicators that represent a sentry’s severity

Icon Where Description
Bottom left This object is a user-defined folder, containing sentries and possibly other sub-folders.
Bottom left This is a locked or system folder– its contents can be modified but the folder itself can’t be removed as it is required by Sentinel3G.
Top right This is an information-only sentry. It has no states. It gives information through its console text and property sheet.


Icon Where Description
Bottom right This sentry is requesting acknowledgement from an operator before changing to another state.
Top right This sentry represents a service that is running.
Top right This sentry represents a service that is not running. Check the console text, notes, or property sheet to find out why.
Top left Notification for this sentry has been disabled.


Indicators that represent a sentry’s severity

If it is not specified for a given sentry then the default indicator representing the current severity is used. The color of the overlay is determined by the severity of the sentry.

The severity of a folder is the maximum severity of all the sentries and sub-folders it contains.

Icon Color Severity Description
Wait Sentry is starting
grey Disabled Sentry is in a state where data is not being returned. Examples: the Host Monitor is down, or the resource itself is disabled.
grey Down Sentry is reporting that a service is not running
Normal Sentry is indicating that there are no problems
blue Information Sentry is reporting matters of interest
orange Warning Sentry is reporting a potential problem
red Alarm Sentry has detected a serious problem that should be investigated as soon as possible
red, flashing Severe Sentry has detected a very serious problem that must be investigated now
magenta, flashing Critical Sentry has detected an extremely serious problem affecting the network or a key application, system, or service. Immediate action is needed.

Indicators that represent data values from a sentry’s variable

Two types of overlay icon, called indicators, can represent actual data from a variable.

A percentage value is mapped to either small pie chart or thermometer, in increments of at least 10 percent. This gives an immediate indication of what the data value is. The type of overlay (pie chart or thermometer) may be specified for each sentry.

The amount of the ‘pie’ that is filled in or the height of the thermometer’s filled-in area gives a rough indication of the quantity being reported. Here are some examples:

Indicator Type 0% 30% 50% 80% 100%
Pie Chart
Thermometer

Sentries and States

A sentry is an individual object or resource that is being monitored though Sentinel3G. Some examples:

Sentries are grouped into classes and are represented on the console as icons. Each sentry has an agent (or possibly more than one agent) that collects data on its behalf about the resource or object being monitored. The data determines what state the sentry is in at any time.

An information-only sentry has no states attached to it, but simply provides useful status information to operators in the form of console text or via its property sheet.

You can maintain most things about a sentry from the console, including its constants, actions, agent, and variables. For example, to configure a sentry’s states, just select the sentry and then select Configure > States.

States

A sentry’s state represents its current operating status or condition. The entry condition for each state is evaluated in turn until one evaluates to true. Most sentries have a normal state, indicating that it is operating satisfactorily and requires no action, and a number of other abnormal states of increasing severity. For example, a simple sentry that monitors a service may have only a couple of states showing whether the service is running or not running. A sentry that monitors a resource such as disk space or memory may have several states whose severity increases as the availability of the resource decreases.

For each state you can:

A sentry does not have to have a state for every severity level. You can define more than one state for the same severity. Although it is possible to define a large number of states, representing small changes in the sentry, it’s better to have a minimum number of states corresponding to real differences in urgency or severity.

Events

An event is an external incident or condition on a particular host or in a particular application or device that is detected by Sentinel3G and passed to the Event Manager for action. In simple terms, an event is a condition that causes a sentry to move from one state to another state.

Entry condition

The entry condition is a TCL expression made up of any combination of agent variables, constants, text strings, numbers, history variables, boolean values, and TCL functions. Typically an entry condition tests the value of an agent variable against a predefined constant or threshold. Some examples:

The entry conditions should cover all possible values returned by the agent. If none of the entry conditions is true, the sentry is put in undefined state. If a sentry is in Failed state, it indicates a problem with the agent (usually that it failed to start or has never returned any valid data). If a state’s entry condition is left blank it always evaluates to true.

Copying states

When you add a sentry, you can choose to copy the states of another sentry. If the states for the new sentry need to be similar but not identical, you can first copy then edit them. Changes to the states of the new sentry will not affect the original sentry.

Severity

Each state that a sentry can be in has a severity level, representing how serious the event is. When you define each state, the standard severity levels are listed in order of increasing severity from normal to critical.

The severity determines how the sentry is displayed on the console—its color, and if it has an indicator icon, the color of the indicator and whether it flashes. The severity is also used for notification. A notification message will be sent if the severity of a sentry is greater than or equal to either:

Notes about severities

disabled is a special severity that can be used to indicate when a sentry is ‘down’ or otherwise unavailable, but doesn’t require attention. Examples:

information severity shows operators that the sentry has some useful information to report. This can be used as a state above normal state, where there is no problem serious enough to require going into a warning state or higher.

Note that this is different from an ‘information-only’ sentry, which has no states and only exists to provide information.

Expressions

TCL expressions are used when defining state conditions, console text, and variables.

For example, state conditions include a expression which, when evaluated, returns true or false to indicate whether the sentry is currently in that state.

Expression are written using TCL syntax and can refer to any variables belonging to the sentry’s primary agent or secondary agents. Normally variables are prefixed with "$". However in console text, variables may instead be prefixed with "&" which displays the variable in a formatted form, including any units. Finally, the history of a variable can be accessed by prefixing the variable with "@". See History Variables for more details.

Example:

Disk $disk I/O rate: $io_rate => Disk hd2 I/O rate: 145.7

Disk $disk I/O rate: &io_rate => Disk hd2 I/O rate: 145.7MB/sec

The following tables list other internal variables that are also available for use in expressions.


Variable Description
$Sentry The name of the sentry
$Class The name of the sentry's class (aka folder)
$Host The sentry's host
$Instance The sentry's instance (if any)
$Group The name of the instance group (if any)
$State The current state that the sentry is in
$Since The time that when the sentry last changed state
$Severity The current severity of the sentry
$PrevState The previous state of the sentry
$Agent The name of the primary agent
$PollTime The polltime of the primary agent in seconds (polled agents only)

Table 1a — Internal variables available in sentry and state expressions


Variable Description
$Agent The name of the agent
$PollTime The polltime of the agent in seconds (polled agents only)
$Instance The agent's instance (if any)
$data The value of the variable as received from the agent (raw variables only)

Table 1b — Internal variables available in raw and derived variable expressions

Actions

Actions are predefined responses associated with a sentry that may be invoked by an operator from the console. Each action is a command that is run on the same host as the host monitor. Actions may be associated with a particular state or may be available at any time.

There are two types:

You can design a single action to work both on selected instances of a multi-instance sentry and on every instance in a selected parent folder. For example, you can set up an action so that the output for every selected instance is combined into one report.

Tasks that don’t require any action or judgement by an operator and can safely be run automatically are better implemented as responses. Data is passed to an action from the host monitor either by being written to the action’s STDIN, or, if the flag Uses agent data: is set to yes, through the environment variables $Sentry, $Host, and $Action. For multi-instance sentries you can refer to a specific named instance or use $Instance, which contains the instance name of the primary agent.

Note: History data and functions and the & <varname> syntax, which are available in state conditions and console text, 
cannot be used in an action. To pass the value of a history function, use a derived variable.

If you wish to format a value returned by an agent you must do it manually in the command.

Examples: defining reports

Example 1 shows how to define a simple report for a single-instance sentry, without using any agent variables. When the report is run it will display in a browser window the name of this action (‘Sentry Details Report 1’), the date, the name of the sentry, and the host it runs on.

Action Sentry Details Report 1
Type report
Command echo -n "Report '$Action' "; date; echo " Sentry: $Sentry"; echo " Host: $Host"
Display command browser
Uses agent data? no
Reads from STDIN? (N/A)
Export to parent? no

In example 2 the agent variables associated with the sentry are exported to the environment (Uses agent data? yes) so that they can be used in the command.

Action Sentry Details Report 2
Type report
Command echo "Free space on $Filesystem = $pct_free%"
Display command browser
Uses agent data? yes
Reads from STDIN? no
Export to parent? no

When you select a filesystem from the console and run the action, the report will show the free space on that filesystem. If you select multiple filesystems, the command will be run once for each instance, and the output window will show one row for each filesystem.

Example 3 demonstrates another way to make data available to an action, this time by reading from STDIN. This passes any agent data to the sentry in Functional Database format (a plain-text table, with rows separated by a newline and fields separated by a tab).

Action Sentry Details Report 3
Type report
Command cat -
Display command db_scroll
Uses agent data? yes
Reads from STDIN? yes
Export to parent? no

When you select a filesystem from the console and run the action, the report will show the raw database row containing the filesystem variables. To read this in a script, you would then need to use db_readrow, a Functional Toolset program.

Use this option if you are familiar with the Functional Toolset and wish to use it to manipulate the data. With this method, unlike the previous example, the command is only run once. The database rows are accumulated before piping them to the Command. Try selecting multiple filesystems and running the action. Note that there is one header row and multiple data rows.

In Example 4, you export the action to the parent folder (Export to parent? yes). This makes the action available from the context menu when the operator clicks on the folder background (that is, no sentries are selected), or on the parent class folder.

Action Sentry Details Report 4
Type report
Command echo "Free space on $Filesystem = $pct_free%"
Display command browser
Uses agent data? yes
Reads from STDIN? yes
Export to parent? yes

When you run this action by clicking on the background of the folder or on the parent class folder, it is the same as selecting all instances and then running the action. If the action were configured on a single instance sentry, it is the same as selecting that single sentry and running the action.

Example: defining an action

Example 5 runs a command to stop the service represented by this sentry.

Action Stop service
Type action
Command system_service $Filename stop
Access role Manager
Authenticate yes
Run as user root
In state(s) Confused Running
Uses agent data? yes

Responses

Responses are commands that are run automatically by the Host Monitor when a sentry is in a particular state. You can define a series of responses for each state that are tailored to the severity of the problem.

Each response may run immediately, or there may be a waiting period after the sentry first enters this state or after the running of a previous response. Figure 4 shows an example of the full set of responses defined for a sentry while it is in warning state.

Each response period is cumulative. In other words the period for Response #2 is counted from the end of the period for Response #1. Example: Response #1 is defined to go to a new severity of warning after 120 seconds. Response #2 is defined to notify after 60 seconds, which will be 180 seconds after the sentry entered this state.

The response Command can attempt to remedy a situation. If successful it will typically return the sentry to a normal state. If the Command does not succeed, you may choose to leave the sentry in that state, and specify a later response to run another command or to notify someone.

Another possible response is to force an agent to be polled at the end of the response period. This is called ‘firing’ the agent. You can fire the primary agent to refresh the variables used by the sentry, or fire another agent to collect additional data.

Where a sentry experiences occasional temporary situations which usually correct themselves quickly, you may not want to take action or be notified unless the sentry has been in that state for some minimum period.

If a sentry changes state while it is waiting to process a response (that is, before the end of the waiting period), then all responses for this state are cancelled, and any responses for the new state are started.

Example: as free disk space in a filesystem reaches a dangerously low level, Sentinel3G can run a series of commands such as:

Any helpful task that can safely be run without prior checking can be set up as an automatic response to an event. Tasks that require some action or judgement by an operator are better implemented as actions.

Escalation

Another way to respond to an alert is simply to wait for a while to see if the problem corrects itself, then to change to another state at the end of that period.

For example, a sentry may be defined to wait up to 300 seconds in warning state, then to change to alarm state. The change of state may depend on manual confirmation from an operator (Acknowledgement) or it may happen automatically (Escalation).

If the problem is normally transient and self-correcting, you could put the sentry into a warning state for a few minutes. At this point the appearance of the sentry is simply a passive signal that the sentry is not in its normal state. If the sentry is still in warning state at the end of this period, it indicates that the problem is unlikely to resolve itself. In this case you could change the sentry to a more severe state with its own set of responses.

In other cases you might return the sentry to a normal state if no other events have occurred by the end of the period. For example, a warning message appearing in a system log file may indicate a potential performance problem, but if no other messages are logged in the next few minutes it may be safe to return the sentry to normal state.

Another use for escalation is to “chain together” several responses by splitting them over two states. Each state has a maximum of three responses.

Note that it may take several seconds for the escalation to be processed at the end of the waiting period.

Acknowledgement

A sentry may request acknowledgement from an operator before changing to another state. This is usually done to confirm that an operator has been made aware of a probable “one-off ” incident before returning the sentry to normal state. For example, if the Bad_SU sentry detects a single failed attempt to gain root privileges, it remains in Report state until:

Prompting for acknowledgement verifies that an operator was made aware of the condition at the time, which can be useful for audit or training purposes. You should provide monitoring notes to help operators understand what their options are when the sentry is in this state, and what will happen next if they acknowledge the alert.

If a sentry is waiting for acknowledgement this overlay icon will appear next to it.

Notification

Sentinel3G can notify a list of staff by e-mail when an event is detected. This is a useful way to alert staff who do not normally run or are not currently running a console. There are three layers or types of notification:

Note that operators can disable notification for selected sentries from the console.

Figure 5 shows a scheme that combines global and sentry-level notification. The NotifyLevel setting is set to severe, so global notification will normally be triggered by any sentry that goes into a state whose severity is severe or critical. There are two exceptions to this: SentryB will send a notification message (perhaps to a different list of recipients) if it goes into a state whose severity is alarm or higher; SentryC will send a notification message only if it goes into a state whose severity is critical.

Figure 5 — Example of both global and sentry-level notification

Figure 6 shows an example of state-level notification. This sentry waits for 300 seconds after entering low state, then runs a script to try to fix the problem. If the sentry is still in low state after another 120 seconds, a notification message is sent to recipients in opsgroup.

State: sufficient

Severity: normal

State: low

Severity: warning

Response 1:

After 300 secs:

Command:

/usr/local/bin/rmtmpfiles

Response 2:

After 120 secs:

Notify:

opsgroup

State: very_low

Severity: alarm

Figure 6 — Example of state-level notification

Global notification is the simplest form to implement as it is set in one place and applies to all sentries. In more complex environments where different people should be notified when different events occur, it may be more appropriate to configure notification at the sentry, instance group or state level.

Agents and Variables

Agents collect data on behalf of sentries. A typical agent works by polling, or running a command at regular intervals. Each time the command runs, its output is stored in a number of variables. These variables are passed to the host monitor to be processed on behalf of sentries, for example to evaluate what state the sentry is in and to display data on the console.

Other types of agent don’t poll but simply wait to receive data, for example from:

Primary and secondary agents

Each sentry has one agent, called its primary agent, that supplies most or all of its variables. A sentry can also access variables belonging to other agents, which are called its secondary agents. Variables are simply referred to by name: $pct_free, $count. If a primary agent and a secondary agent both have a variable with the same name, the primary agent’s variable is used.

There is an important difference between primary and secondary agents that you should be aware of. A sentry’s state evaluations are normally done when its primary agent returns data, not when the secondary ones do. This can lead to some unexpected behaviour if the secondary agent data is out of date.

For example: there are two agents. The first agent monitors whether the Staff database is up or down. The other agent monitors whether the Payroll application (which happens to use the Staff database) is up or down. There is a sentry for the application. This sentry has different states to distinguish between the Payroll application being down because the Staff database is down, and the application being down for another reason.

You would configure the Staff database agent as a secondary agent so you could use the "is_up" variable that belongs to it. However, if the poll times of the two agents are not the same (and they usually won't be) there is a potential problem. You can have a situation where the application agent reports that the Payroll application is down because it has detected that the Staff database is down, but the database agent hasn’t had its poll yet, and still ‘thinks’ the Staff database is up.

The solution is to ‘trigger’ the sentry, which forces its state to be reevaluated when the secondary agent returns new data. The effect is to force the primary and secondary agents to synchronize their polling.

Discovery program

This is an optional command that is run before the agent starts. Its job is to return an exit status of true or false based on the existence or status of a resource. If the discovery program returns false, this agent and its associated sentries will not be started. This means the same set of KBs can be installed on several servers, and an agent on a particular server can be switched off if it ‘discovers’ that the resource it monitors is not present.

Here are two examples:

Monitoring file updates: the FileInfo agent

Sometimes you may need to monitor when a particular file or files change in some way. For example, you could log when a system file such as the password file has been updated and perhaps generate an alert.

Sentinel3G provides a standard agent that can monitor these events:

"Windowing" States Based on Time: the Clock Agent

The Clock agent can be used to stop a sentry from monitoring during particular periods. For example, if you run batch jobs between 11pm and 6am daily that use lots of CPU, you don't want to be notified if the run_queue gets too high as this is expected during these times. So you "window" the monitoring.

Add the Clock agent as a secondary agent to the sentry you wish to window (in our example, Run_Queue).

Add a new state to the sentry, called Not_Monitored, and give it a severity of Disabled. In the Condition field, enter a boolean expression describing the time you want to exclude the sentry from monitoring. In our previous example, where the batch jobs run between 11pm and 6am, this would be:

$Hour >= 23 || $Hour < 6

Make sure that the Not_Monitored state appears at the top of the list of states so that its condition is evaluated first.

If the requirement was to disable monitoring during 11pm - 6am Monday to Friday only, it gets a bit more complicated, because you need to remember that Friday night's batch jobs actually go until 6am on Saturday:

$Hour >= 23 && $DayOfWeek >= 1 && $DayOfWeek < 6 || $Hour < 6 && $DayOfWeek > 1 && $DayOfWeek <= 6

Table 2 lists the variables you can use to window monitoring for a sentry.

Variable Type Description
Day number Day of the month (1-31)
DayName string Name of the day of the week, capitalized, e.g. Monday
DayofWeek number Day of the week as a number (Sunday = 0)
DayofYear number Day of the year (1-366)
Hour number Hour of the day (0-23)
LastDayofMonth boolean True when today is the last day of the month
LastWeekofMonth boolean True when within 7 days of the end of the month
Minute number Minute in the hour (0-59)
Month number Month as a number (1-12)
MonthName string Name of the month, capitalized, e.g. June
Time clock Number of seconds since 1st January 1970 GMT
TimeOfDay string Time in the form HH:MM (00:00 - 23:59)
TimeZone string Timezone configured on the system as a string, e.g. GMT
Week number Week of the year (0-52), week begins Sunday
Year number 4 digit year, e.g. 2003

Table 2 — Clock agent variables

Process Monitoring: the ProcessInfo Agent

The ProcessInfo agent provides data for a process monitoring console on each host.

The ProcessInfo agent returns data about processes running on the local (Host Monitor) host [see ps(1)]. It is typically used to determine whether a process is running or to monitor its CPU or memory usage. ProcessInfo is a multi-instance agent, whose instances are the usually the names of the processes being monitored.

They must be specified in the Instances field of each sentry using this agent.

Processes are matched to instances by doing pattern matches on the Command field (as returned by the ps -efl command). If the Agent data field of an instance is NULL, then an exact string match is performed using the instance name. Otherwise the Agent data is interpreted as an unanchored full regular expression.

Note that if one instance matches more than one process, only details of the first process found are returned. However the count variable is set to the number of matching processes.

The variables returned by the ProcessInfo agent are:

command
The command running (the full name including command line options). Note that it maybe truncated if long.
count
The number of matching processes found.
cpu
The number of CPU seconds used by the process.
pid
The numeric process ID.
ppid
The numeric parent process ID.
priority
The numeric priority at which the process is running.
size
The size of the memory image of the process.
state
The state of the process (see ps(1)).
tty
The controlling terminal of the process.
user
The name of the user owning the process.

Agent Classes and Variables

Agents make data available to sentries in the form of variables. The agent class tells sentinel3G the format and location (e.g. STDOUT, a file name) of the agent data, how to parse it, and how to assign key data to variables.

The format of the agent output, and the way you identify which part of it to assign to a variable, differs depending on the agent class. This topic explains the attributes of each agent class.

API

An external application sends data via the Sentinel3G API. The application must be instrumented to send a string of variable names and their values to the host monitor at certain processing points, such as when a transaction is committed.

The API class is different from other agent classes in that data is ‘pushed’ to the host monitor at intervals decided by the external application, rather than being ‘pulled’ in by Sentinel3G. Therefore you don’t specify a column name when adding a variable. Instead you define one variable for each varname= value pair that is passed in the SENAPIdata command by the external application.

DB

The agent returns data in Functional Database format (a set of one or more records, each containing text fields delimited by tabs and terminating in a newline). Typically the data comprises several fields or one or more whole rows returned as a result of a query on a Functional Database table.

Each column name that you assign to an agent variable is a field name as specified in the Functional Database dictionary entry.

ExitStatus

The agent returns the exit status of the command. This can be used to monitor scheduled processes such as batch jobs and backups where there are a few common exit statuses, each relating to a different error condition. Example: when a backup job fails, the sentry can translate the exit status into a meaningful console message (such as "media change failed" or "error writing to device") and provide appropriate responses and actions.

You don’t need to specify a column name when adding a variable to store the exit status. Instead you define one variable of type raw, leaving the Column field blank. The exit status of the agent command will automatically be assigned to this variable.

LogFile

A convenient way of detecting events in an existing application with minimal intrusion is by monitoring its log file(s) for certain messages. The LogFile agent class allows alarms to be generated based on the contents of log files such as:

The agent searches in the file for messages that match a pattern. In the Agent options form you can specify the file name, a select pattern to select records of interest, and an extract pattern for each text string in the record that must be assigned to a variable.

The log file may contain a mixture of messages of different types but typically we are only interested in one type. If you are interested in differently formatted messages you could define one agent per record type.

Agents in the Logfile class generate one or more lines of text output, such as an error message. Table 3 explains how to split the data into patterns or columns. Table 4 explains how to assign each column to a variable.

SNMPPolled

The agent polls for the results of SNMP ‘Get’ requests. Typically these requests test the current status of a managed object in an SNMP MIB, such as a device or port.

Each column name that you assign to an agent variable must be an object ID as specified in the SNMP MIB.

Note: This agent class is only available if the SNMP KB has been installed.

Text

This is used to filter the output from a command. The agent runs the command, which writes text output to STDOUT. If the output is complex or split over several lines, you can use the Agent options form to filter out extraneous text such as blank lines, header lines, and labels.

Agents in the Text class generate one or more lines of text output, such as a formatted report. Table 3 explains how to split the data into patterns or columns.

Assigning Text and Log File Data to Variables

For agents in the Text and Logfile class, the data is split into one or more fields, which are identified by number. How the fields are split is determined by the Split data by field in the ‘Agent options’ form, as shown in Table 3:

Split data by Notes
column The line is not split into fields. All variables must be identified by character position on the line.
whitespace The line is split into a series of fields separated by whitespace. The first field is column 1, the second is column 2, and so on.

Example: to assign to the variable the characters from the start of the line to the first whitespace character, enter 1 in the Column field.

tab The line is split into a series of fields, each separated by a tab. The first field is column 1, the second field is column 2, and so on.

Example: to assign to the variable the characters between the first and second tab, enter 2 in the Column field.

pattern Specify in the Pattern line fields of the Agent Options form one or more extract patterns. The first extract pattern is treated as column 1, the second extract pattern column 2, and so on.

Example: to assign to the variable the string that matches the third extract pattern, enter 3 in the Column field.

Table 3 — How agent data is split into one or more numbered fields

When you add a variable, you specify in the Column field which of these columns to assign to the variable.

For agents where the data is split pattern, you simply enter the column number <col>. If the data is split by whitespace or tab, you can enter a single column number if the agent returns all data on one line. If the agent returns several lines of data, you can specify a particular line by prefixing the column number with the line number, like this: <line>: <col>.

For agents where the data is split by column, each line is treated as a string of characters. You must specify in the Column field a range of characters in the form c <pos>-<pos>. If the agent returns several lines of data, you can specify a particular line by prefixing the character range with the line number, like this:

<line>:c <pos>-<pos>
Column field Split data by Notes
3 pattern the third extract pattern in the output
2:3 whitespace the third field in the second line of output
c10-40 column the tenth to the fortieth characters inclusive
3:c10-11 column the tenth and eleventh characters on the third line of output
2:c4-end column from the fourth to the last character on the second line

Table 4 — Examples: assigning columns to a variable

Note: The Agent Options form includes several fields (e.g. Clear pattern, Skip initial lines, Strip initial chars) that allow you to strip unwanted lines and characters from the output before assigning columns to variables. All processing of whitespace, tabs, column numbers, or patterns, takes place after these fields have been processed. For example if Skip initial chars = 6, the column that the variable sees as c1 would actually be the 7th character in the original data (assuming that Skip pattern hasn’t removed even more characters).

Trigger Variables

A trigger variable is used to compare an earlier value of a variable with its current value. Trigger variables ‘remember’ a value from an earlier poll. (The name comes from the way in which the saving of the variable is triggered by a state change.) All other variables are set or recalculated every time the agent returns data, meaning that the previous value is overwritten.

Example: batches of update transactions are added to a data file once or twice a day. You want to write a sentry that notifies you whenever the spool file’s modification time changes. Using the FileInfo agent, you create a raw variable called mtime to store the current modification time, and a trigger variable called prev_mtime.

The Initial value of prev_mtime is set to $mtime, and the Expression is also set to $mtime.

Next you create a sentry with two states:

When the file changes, the operating system updates its modification time. The sentry detects that the new value for $mtime is different from $prev_mtime, and changes to NewTrans state. When an operator acknowledges the event, indicating that the new transactions have been noted, the sentry is returned to normal state.

At this point the trigger variable is recomputed, setting $prev_mtime to the new $mtime.

History Variables

History variables store the recent values returned by an agent variable. They can be used to generate a realtime graph showing recent changes the data, and in state conditions, to average out spikes and gaps in the data. You can keep either a set number of values, or keep all values in a set period.

Note that history is suited to fairly short-term analysis. For longer term analysis such as capacity planning, use logged variables.

Using history variables in realtime graphs

You can generate a realtime graph showing recent changes in a variable. If a variable’s history has been saved, and the graph is defined to graph the last N values, it will use the variable’s history to get these values (or as many as there are available).

Using history variables in state conditions

You can use history variables in state conditions to handle exceptions in the data such as ‘spikes’ (high or low values which are transient and do not need to be displayed or acted on), to calculate an average over a number of polls, or to calculate a rate when the agent only returns a raw count etc.

Functions for Accessing History Variables

This topic describes the methods or functions that can be performed on history variables. You would typically use these in an expression, either in a state condition or in a derived variable.

History variables can be thought of as an array containing two fields: the value and the time. The array is accessed backwards in time: index 0 is the current value, 1 is the previous value, and so on. It can be written as two arrays: Hval[n] and Htime[n]. You can reference the variable’s history by putting "@" in front of the variable name (normally you refer to a variable by putting a "$" in front of the name, which returns the current value). You access history variables using one of the predefined methods. The TCL syntax is:

[@<hist-var> <method> <optional params>]

Example: [@cpu_idle_hist value 1]

This returns the previous value (index 1) of the history variable cpu_idle_hist. (Note the square brackets around the call to the method). You can use this within an expression or condition:

[@cpu_idle_hist value 1] / 100.0

Table 5 lists the functions available to process history variables.

Function Description
value <index> Return the value of a particular index of a history variable. Omitting the index will return the most recent element. [@accesses_kb value] => 0

[@accesses_kb value 14] => 12.5

value_at <clock> The value at a given time (clock format) [@free_space value_at 1052824642] => 5
average Return the arithmetic mean of a history variable over its entire history [@response_time average] => 9.5
max Return the maximum history value (of the current values) [@cpu_usage max] => 97.0
min Return the minimum history value (of the current values) [@raw_packets_out min] => 0.0
earliest_time The value of Htime[end] where end is the oldest value [@cpu_usage earliest_time] => 1171194067
earliest_value The value of Hval[end] where end is the oldest value [@cpu_usage earliest_value] => 51.6
diff <index> Return the difference between the most recent element and the element at the specified index. Omitting the index returns the difference between the most and least recent elements. [@cpu_usage diff 5] => -25.4

[@free_space diff] => 15

rate <index> The elements are averaged over the time between the elements. Whatever the unit of the history variable, the result is always “units per second”. For example, if the history is in “MB”, the result will be in “MB per second”. [@free_space rate] => 0.6

[@file_size rate 5] => 1.5

diff_rate <index> This is similar to rate, but is but used for agents that return a cumulative number. First the difference between the elements is calculated, then this is divided by the time. [@total_kbs_sent diff_rate] => 1.5

[@cum_cpu_time diff_rate 1] => 0.2

count <condition> Return the number of values that match a simple condition, such as “< 7”. If <condition> is omitted a count of the current number of history values is returned. [@free_space count "< 150"] => 1

[@cpu_idle_hist count “> 90”]

(This returns the number of values whose value is greater than 90.)

percent <condition> This is similar to count, but returns the percentage of values meeting the condition. [@response_time "< 2000"] => 9.5

Table 5 — Functions for accessing history variables

Notes:

Using Standard TCL Functions

To use a function, enclose it in square brackets. For example: [string range "hello" 0 1] will return the value of “he”. All standard TCL functions are available.

Using variables in functions

Scalar variables are referenced by adding a $ to the front. For example, if you had a string called hostname, you could convert it to uppercase using this command: [string toupper $hostname].

History variables are referenced by adding an @ to the front. History variables cannot be used in standard TCL functions, only the functions mentioned below. History functions are accessed slightly differently than scalar functions in that the variable comes first. For example, if you wanted to calculate the average over a history variable called response_time you would enter:

[@response_time average]

If the function takes parameters, they come after the function name:

[@my_history function <args>]

History variables are indexed in reverse order. The newest element will always be at index 0. Therefore index 1 is the second most recent and index 5 is the sixth most recent element.

Additional scalar functions

Table 6 lists some additional functions. The examples all use numbers, but you can replace any of these with scalar variables (e.g: $my_variable). If you reference a variable that does not exist, the Host Monitor log file will display a message and the agent or sentry will not be started.

Table 6 — Additional TCL functions

Function Description Examples
percent <value> <total> Divide <value> by <total> and return answer as a percentage.

If <total> is zero, the return value is always zero.

[percent 45 100] = 45.0

[percent 75 150] = 50.0

[percent 100 0] = 0

div <v1> <v2> Safely divide <v1> by <v2>.

If <v2> is zero, the return value is always zero.

[div 5 10] = 0.5

[div 10 0] = 0

round <n> <multiple> Round <n> to the nearest <multiple>. [round 12345 100] = 12300

[round 1.2345 0.01] = 1.23

[round 1.3 0.5] = 1.5

hostname <IP address> Return the host name for a given IP address [hostname 99.99.99.99] = www.s3gxyz.com
clock_to_db <secs> Convert a time in “clock seconds” to internal date/time (YYYYMMDD.hhmmss). [clock_to_db 1052824642] = 20030513.121722
db_to_clock <date> Convert internal format date/time to “clock seconds”. [db_to_clock 20030513.121722] = 1052824642
fmt_clock <secs> Convert “clock seconds” date/time to display format. [fmt_clock 1052824642] = 13/05/03-12:17

[fmt_clock 0] = 01/01/70-01:00

fmt_boolean <val> Format boolean type for display. [fmt_boolean 0] = false

[fmt_boolean 1] = true

fmt_date <date> Format date type (YYYYMMDD) for display. [fmt_date 20030513] = 13/05/03
fmt_datetime <date> Format datetime type (YYYYMMDD.hhmmss) for display. [fmt_datetime 20030513.121722] = 13/05/03-12:17
fmt_uptime <secs> Format uptime (in seconds) for display. [fmt_uptime 0] = 0 secs

[fmt_uptime 300] = 5.0 mins

[fmt_uptime 4000] = 1.1 hrs

[fmt_uptime 100000] = 1.2 days

Constants and Thresholds

Constants are like variables, but unlike variables are associated with a sentry rather than an agent. They are typically used to set different thresholds for sentries that share the same states. This is useful for monitoring different “sizes” of the same type of resource using different criteria. Constants are by convention on UPPER CASE to differentiate them from variables.

How it works: States refer to constants by name, so the same set of states can be shared between the two sentries. Because constants are not shared, you can set the constants of the two sentries to different values.

For example, the acceptable minimum amount of free space on a filesystem depends on its size and volatility. 3% may be an acceptable threshold for a fairly static 100 GB filesystem, but dangerously low for a volatile 15 GB filesystem. In this case you could have each sentry sharing the same states, including a state called Very_Low.

The entry condition for this state would test the current value reported by the agent against a constant, called VERY_LOW. The difference is that for the large_filesys sentry, the constant VERY_LOW is set to 3%, while for the small_filesys sentry, the constant VERY_LOW is set to 8%:

sentry state name (shared) entry condition for this state VERY_LOW constant
small_filesys Very_Low $pct_free < $VERY_LOW 8
large_filesys Very_Low $pct_free < $VERY_LOW 3

Table 7 — Different thresholds for sentries that share the same states

Constants may also be used as a visual aid on realtime graphs.

Note: The values of constants may also be set in Instance Groups, and these will override the values defined in the sentry but only for sentries in that particular instance group. In fact, this example is probably better implemented using a single sentry with two Instance Groups.

Sentinel Processes and Configuration

Event Manager

The Event Manager is a central process that collects state information from all host monitors and updates the icons and data on the consoles as required.

Host Monitor

The host monitor is the main processing ‘engine’. One host monitor process runs on each Sentinel3G host. The program is responsible for:

Knowledge Base

A sentry monitors a particular component or subsystem of your operating system, hardware and applications. A folder is a related set of sentries. Sentries and folders are themselves grouped into knowledge bases.

Several knowledge bases are available for Sentinel3G, including knowledge bases for operating systems, databases, web servers and applications such as COSmanager.

You can add your own custom knowledge bases to hold details of sentries you define yourself.

Host Monitor API

The Host Monitor API can be used to instrument existing applications to send data for monitoring direct to the Host Monitor, rather than having to write a script or program which polls for this data. This is an extremely flexible interface—you just need to tell Sentinel3G what variables to expect, and their types.

Logging

Sentinel3G maintains the following types of log file:

All records are time-stamped.

The status logs EventMgr and HostMon can be viewed from the Logs menu on the console.

Note that there is some overlap in the state change logging done by the Host Monitor and the Event Manager. This is to ensure that even if the Event Manager is down, that state changes are still logged, as well as to keep the network traffic to a minimum.

Global default settings for logging

To conserve disk space, you can control the amount of data that is logged. Agent variable logging can be varied according to the state of a particular sentry. While a sentry is operating normally we are not interested in the exact values being returned by the agent. Therefore, at lower severity levels there is little need to collect data beyond recording state changes. At higher severity levels such as alarm you may wish to log variable values more often, perhaps once every poll.

In the global Sentinel settings you can specify both a logging frequency (DefLog-Time) and the minimum severity level at which it operates (DefLogSeverity).

There is also the option to specify different settings for particular sentries. DefLogTime specifies how often to log data for use in logged data reports. At this interval, the latest data values will be written to disk. DevLogSeverity is the minimum severity at which to start logging data every poll.

For example, if you have specified a log time of 30 minutes and a minimum severity of alarm, under normal conditions Sentinel3G logs a single data point every 30 minutes. When the sentry goes into a state whose severity is alarm or higher, every data point is logged until the severity goes back below the minimum.

Reports on logged data

Reports are provided to extract and summarize data from the data logs, and to graph the value of numerical data. The Service Level Report searches the EMdata log and produces a summary of the amount of time the selected sentries have spent in each state or severity. The Logged Data Report searches the HMdata log for recorded values of particular variables. You choose the variables to display, and a line graph for those variables will be drawn over the chosen period.

Managing log files

The management of logs is integrated with the COSmanager™ audit trail facility, which provides for viewing and cycling (pruning) of log files.

If COSmanager is not installed, an automatically scheduled task such as a cron job should be set up to regularly compress and archive a copy of each log file. Once copies have been archived the original logs can be reset to save disk space.

Access Control via Roles and Capabilities

Each Sentinel3G user has one or more roles. Each role identifies a responsibility or class of users in your organization, such as Manager or Operator. Roles are defined in terms of the access capabilities they grant. In turn, capabilities determine what menu options and actions a user can perform.