On an X Architecture compute node, both
hardware alerts and UEFI alerts are detected and sent to the IBM Flex System Manager.
The following diagram shows the flow of alerts from an X Architecture compute node to the IBM Flex System Manager:
Alert flow (hardware released 11/2014 or later)
To understand how alerts flow through the
Flex System chassis
when an error is detected on a
X Architecture compute node, consider
an example in which the correctable Error Correction Code (ECC) memory
error logging threshold for a
X Architecture compute node is reached.
This event is a predictive failure alert (PFA), which means that the
X Architecture compute node will
continue to function, but there could be a memory failure at some
point. When the management processor on the
X Architecture compute node detects
this problem, the following actions occur:
- The problem is logged by the management processor in the system
event log on the X Architecture compute node.
The
following example shows how the event might appear in the system event
log:
58001 - The PFA Threshold limit (correctable error logging limit) has been exceeded on DIMM number %
at address %. MCx_Status contains % and MCx_Misc contains %.
UEFI
will also detect this problem, and it will log an event in the system
log as well. It will log the same event in the event log.
58001 - The PFA Threshold limit (correctable error logging limit) has been exceeded on DIMM number %
at address %. MCx_Status contains % and MCx_Misc contains %.
- An alert is sent from the management processor on the X Architecture compute node to the CMM and posted
to the event log on the CMM.
The following
example shows the error that might be posted to the
CMM event log:
58001 - The PFA Threshold limit (correctable error logging limit) has been exceeded on DIMM number %
at address %. MCx_Status contains % and MCx_Misc contains %.
In
addition, the IMM message is listed in the details of this event to
enable you to correlate the message on the CMM with the message
that appears in the system event log for the X Architecture compute node.
- An alert is also sent from the management processor on the X Architecture compute node to the IBM Flex System Manager and posted
to the event log on the IBM Flex System Manager.
The following
example shows the error that might be posted to the event log on the
IBM Flex System Manager:
58001 - The PFA Threshold limit (correctable error logging limit) has been exceeded on DIMM number %
at address %. MCx_Status contains % and MCx_Misc contains %.
Note: If the Electronic Service Agent is enabled on the
IBM Flex System Manager, the
IBM Flex System Manager performs the
following actions:
- Loads Lenovo Dynamic Systems Analysis (DSA) on the X Architecture compute node and runs
DSA to collect service data related to the X Architecture compute node.
- Sends a notification to the Support team, which includes the collected
service data.
- Removes DSA from the X Architecture compute node.
- An alert is sent from the CMM to the IBM Flex System Manager and posted
to the event log on the IBM Flex System Manager.
The following
example shows the error that might be posted to the event log on the
IBM Flex System Manager:
58001 - The PFA Threshold limit (correctable error logging limit) has been exceeded on DIMM number %
at address %. MCx_Status contains % and MCx_Misc contains %.
Note: The event from the CMM is not sent to the Support team. Only the event received
from the management processor on the X Architecture compute node is sent.
Alert flow (hardware released before 11/2014)
To understand how alerts flow through the
Flex System chassis
when an error is detected on a
X Architecture compute node, consider
an example in which the correctable Error Correction Code (ECC) memory
error logging threshold for a
X Architecture compute node is reached.
This event is a predictive failure alert (PFA), which means that the
X Architecture compute node will
continue to function, but there could be a memory failure at some
point. When the management processor on the
X Architecture compute node detects
this problem, the following actions occur:
- The problem is logged by the management processor in the system
event log on the X Architecture compute node.
The
following example shows how the event might appear in the system event
log:
806F050C - 2001xxxx - Memory Logging Limit Reached for DIMM_number on MemoryElementName.
UEFI
will also detect this problem, and it will log an event in the system
log as well. The following example shows how a UEFI diagnostic event
might look in the system event log:
E.0058001 - PFA Threshold Exceeded.
- An alert is sent from the management processor on the X Architecture compute node to the CMM and posted
to the event log on the CMM.
The following
example shows the error that might be posted to the
CMM event log:
0x77777773 - Compute_node_bay: The correctable Error Correct Code (ECC) memory logging threshold for the specified blade server was reached. The system will continue to run. Refer to the steps in the user response before replacing a DIMM.
In addition, the IMM message is listed in the details of this
event to enable you to correlate the message on the CMM with the message
that appears in the system event log for the X Architecture compute node.
- An alert is also sent from the management processor on the X Architecture compute node to the IBM Flex System Manager and posted
to the event log on the IBM Flex System Manager.
The following
example shows the error that might be posted to the event log on the
IBM Flex System Manager:
806F050C - 2001xxxx - Memory Logging Limit Reached for DIMM 1 on MemoryElementName.
Note: If the Electronic
Service Agent is enabled on the
IBM Flex System Manager, the
IBM Flex System Manager performs the
following actions:
- Loads Lenovo Dynamic Systems Analysis (DSA) on the X Architecture compute node and runs
DSA to collect service data related to the X Architecture compute node.
- Sends a notification to the Support team, which includes the collected
service data.
- Removes DSA from the X Architecture compute node.
- An alert is sent from the CMM to the IBM Flex System Manager and posted
to the event log on the IBM Flex System Manager.
The following
example shows the error that might be posted to the event log on the
IBM Flex System Manager:
0x77777773 - Compute_node_bay: The correctable Error Correct Code (ECC) memory logging threshold for the specified blade server was reached. The system will continue to run. Refer to the steps in the user response before replacing a DIMM.
Note: The event from the CMM is not sent
to the Support team. Only the event received from the management processor
on the X Architecture compute node is sent.