The Integrity rx1600, rx2600, rx4640, and rx5670 servers employ dynamic processor resiliency (DPR), too. With
DPR, any CPU generating correctable cache errors at a rate deemed unacceptable is de-allocated from use by the
system. This feature helps protect against a CPU degrading to the point where it may cause system crashes.
DPR works like this: When excessive errors are reported against a CPU, the CPU is deactivated (that is, the
operating system will not schedule any new processes on it). The system firmware remembers the CPU’s serial
number and the time when this action was taken. From then on, at each poll interval the system monitor checks (by
comparing the serial numbers) to see if the CPU has been replaced or not. If the processor has been replaced, its
history is reset.
If the system is rebooted before the offending CPU has been replaced, the monitor generates a warning message
and immediately de-allocates the CPU. (Such CPU de-allocation is only supported in the HP-UX operating system.
It is not supported in Windows or Linux.)
Comprehensive error logs
All system events are stored in the system event log (SEL) in nonvolatile memory. In addition, system firmware
creates activity and forward progress logs (FPLs) in nonvolatile memory. In all but the most extreme situations—that
is, in more than 95 percent of cases—this information will be sufficient to diagnose system failures to a single
replaceable part. The SEL and FPL are available to both the management processor (and therefore are available
remotely) and to system-level tools, leading to quick and accurate diagnosis.
Fault management throughout the lifecycle
Fault management is HP’s overall strategy and program to provide a complete value chain for detection,
notification, and repair of system problems. Fault management starts right during the design phase, when
hardware and OS designers include capabilities and instrumentation points that provide the ability to detect and
isolate system anomalies. Monitors are created to poll for system health information or to asynchronously respond
to instrumentation points that have been designed into the system to report problems or faults.
Fault management also involves implementing several methods for maintaining historical event information,
allowing preservation of information for analysis or trending. Faults that generate errors and warnings are
automatically logged to syslog, while notes and audit information are copied to an event log. Other options are
available for preserving historical information as well.
Fault management provides immediate alerts of problems—and even potential problems—as soon as they are
detected, so customers can take corrective action. In some cases fault monitors are actually smart enough to repair
faults or prevent them from occurring.
Capabilities of fault monitors
Fault management, coupled with the monitoring capabilities, keeps tabs on the health of system components and
generates close to real-time events when problems develop. These events can trigger corrective action to enable
the system to continue functioning, or they can trigger alerts to systems personnel to appropriately handle the
situation before it becomes more severe.
Fault monitors are able to:
• Poll the system for health information
• Handle asynchronous events that have been designed into the hardware or software
• Perform corrective action when possible
• De-allocate failing memory before it fails (dynamic memory resiliency)
• De-allocate failing processors before they fail (dynamic processor resiliency)
• De-configure failed processors from the working set before the next reboot
• Shut down the system when power failure causes a switch to UPS
• Manage events so that system performance is not hindered in the face of errors
• Provide information on problem causes and what actions to take
Notification and integrated enterprise management
Fault management currently uses the HP EMS (Event Monitoring Service) infrastructure for its notification
methodology. EMS enables a wide variety of notification methods, including pager, e-mail, SNMP traps,
system console, system log, text log file TCP/UDP, and HP OpenView Operations center (OPC) messaging. Fault
27
Comentários a estes Manuais