Events detected on a X Architecture compute node

On an X Architecture compute node, both hardware alerts and UEFI alerts are detected and sent to the IBM Flex System Manager.

The following diagram shows the flow of alerts from an X Architecture compute node to the IBM Flex System Manager:
Alert flow for system x compute nodes.

Alert flow (hardware released 11/2014 or later)

To understand how alerts flow through the Flex System chassis when an error is detected on a X Architecture compute node, consider an example in which the correctable Error Correction Code (ECC) memory error logging threshold for a X Architecture compute node is reached. This event is a predictive failure alert (PFA), which means that the X Architecture compute node will continue to function, but there could be a memory failure at some point. When the management processor on the X Architecture compute node detects this problem, the following actions occur:
  1. The problem is logged by the management processor in the system event log on the X Architecture compute node.
    The following example shows how the event might appear in the system event log:
    58001 - The PFA Threshold limit (correctable error logging limit) has been exceeded on DIMM number % 
    at address %. MCx_Status contains % and MCx_Misc contains %.

    UEFI will also detect this problem, and it will log an event in the system log as well. It will log the same event in the event log.

    58001 - The PFA Threshold limit (correctable error logging limit) has been exceeded on DIMM number % 
    at address %. MCx_Status contains % and MCx_Misc contains %.
  2. An alert is sent from the management processor on the X Architecture compute node to the CMM and posted to the event log on the CMM.
    The following example shows the error that might be posted to the CMM event log:
    58001 - The PFA Threshold limit (correctable error logging limit) has been exceeded on DIMM number % 
    at address %. MCx_Status contains % and MCx_Misc contains %.

    In addition, the IMM message is listed in the details of this event to enable you to correlate the message on the CMM with the message that appears in the system event log for the X Architecture compute node.

  3. An alert is also sent from the management processor on the X Architecture compute node to the IBM Flex System Manager and posted to the event log on the IBM Flex System Manager.
    The following example shows the error that might be posted to the event log on the IBM Flex System Manager:
    58001 - The PFA Threshold limit (correctable error logging limit) has been exceeded on DIMM number % 
    at address %. MCx_Status contains % and MCx_Misc contains %.
    Note: If the Electronic Service Agent is enabled on the IBM Flex System Manager, the IBM Flex System Manager performs the following actions:
    1. Loads Lenovo Dynamic Systems Analysis (DSA) on the X Architecture compute node and runs DSA to collect service data related to the X Architecture compute node.
    2. Sends a notification to the Support team, which includes the collected service data.
    3. Removes DSA from the X Architecture compute node.
  4. An alert is sent from the CMM to the IBM Flex System Manager and posted to the event log on the IBM Flex System Manager.
    The following example shows the error that might be posted to the event log on the IBM Flex System Manager:
    58001 - The PFA Threshold limit (correctable error logging limit) has been exceeded on DIMM number % 
    at address %. MCx_Status contains % and MCx_Misc contains %.
    Note: The event from the CMM is not sent to the Support team. Only the event received from the management processor on the X Architecture compute node is sent.

Alert flow (hardware released before 11/2014)

To understand how alerts flow through the Flex System chassis when an error is detected on a X Architecture compute node, consider an example in which the correctable Error Correction Code (ECC) memory error logging threshold for a X Architecture compute node is reached. This event is a predictive failure alert (PFA), which means that the X Architecture compute node will continue to function, but there could be a memory failure at some point. When the management processor on the X Architecture compute node detects this problem, the following actions occur:
  1. The problem is logged by the management processor in the system event log on the X Architecture compute node.
    The following example shows how the event might appear in the system event log:
    806F050C - 2001xxxx   - Memory Logging Limit Reached for DIMM_number on MemoryElementName.

    UEFI will also detect this problem, and it will log an event in the system log as well. The following example shows how a UEFI diagnostic event might look in the system event log:

    E.0058001   - PFA Threshold Exceeded.
  2. An alert is sent from the management processor on the X Architecture compute node to the CMM and posted to the event log on the CMM.
    The following example shows the error that might be posted to the CMM event log:
    0x77777773 - Compute_node_bay: The correctable Error Correct Code (ECC) memory logging threshold for the specified blade server was reached. The system will continue to run. Refer to the steps in the user response before replacing a DIMM.

    In addition, the IMM message is listed in the details of this event to enable you to correlate the message on the CMM with the message that appears in the system event log for the X Architecture compute node.

  3. An alert is also sent from the management processor on the X Architecture compute node to the IBM Flex System Manager and posted to the event log on the IBM Flex System Manager.
    The following example shows the error that might be posted to the event log on the IBM Flex System Manager:
    806F050C - 2001xxxx   - Memory Logging Limit Reached for DIMM 1 on MemoryElementName.
    Note: If the Electronic Service Agent is enabled on the IBM Flex System Manager, the IBM Flex System Manager performs the following actions:
    1. Loads Lenovo Dynamic Systems Analysis (DSA) on the X Architecture compute node and runs DSA to collect service data related to the X Architecture compute node.
    2. Sends a notification to the Support team, which includes the collected service data.
    3. Removes DSA from the X Architecture compute node.
  4. An alert is sent from the CMM to the IBM Flex System Manager and posted to the event log on the IBM Flex System Manager.
    The following example shows the error that might be posted to the event log on the IBM Flex System Manager:
    0x77777773 - Compute_node_bay: The correctable Error Correct Code (ECC) memory logging threshold for the specified blade server was reached. The system will continue to run. Refer to the steps in the user response before replacing a DIMM.
    Note: The event from the CMM is not sent to the Support team. Only the event received from the management processor on the X Architecture compute node is sent.