Events detected on a X Architecture compute node

On an X Architecture compute node, both hardware alerts and UEFI alerts are detected and sent to the IBM Flex System Manager.

The following diagram shows the flow of alerts from an X Architecture compute node to the IBM Flex System Manager:
Alert flow for system x compute nodes.

Alert flow (hardware released 11/2014 or later)

To understand how alerts flow through the Flex System chassis when an error is detected on a X Architecture compute node, consider an example in which the correctable Error Correction Code (ECC) memory error logging threshold for a X Architecture compute node is reached. This event is a predictive failure alert (PFA), which means that the X Architecture compute node will continue to function, but there could be a memory failure at some point. When the management processor on the X Architecture compute node detects this problem, the following actions occur:
  1. The problem is logged by the management processor in the system event log on the X Architecture compute node.
    The following example shows how the event might appear in the system event log:
    58001 - The PFA Threshold limit (correctable error logging limit) has been exceeded on DIMM number % 
    at address %. MCx_Status contains % and MCx_Misc contains %.

    UEFI will also detect this problem, and it will log an event in the system log as well. It will log the same event in the event log.

    58001 - The PFA Threshold limit (correctable error logging limit) has been exceeded on DIMM number % 
    at address %. MCx_Status contains % and MCx_Misc contains %.
  2. An alert is sent from the management processor on the X Architecture compute node to the CMM and posted to the event log on the CMM.
    The following example shows the error that might be posted to the CMM event log:
    58001 - The PFA Threshold limit (correctable error logging limit) has been exceeded on DIMM number % 
    at address %. MCx_Status contains % and MCx_Misc contains %.
  3. An alert is also sent from the management processor on the X Architecture compute node to the Lenovo xClarity Administrator and posted to the event log on the Lenovo xClarity Administrator.
    The following example shows the error that might be posted to the event log on the IBM Flex System Manager:
    58001 - The PFA Threshold limit (correctable error logging limit) has been exceeded on DIMM number % 
    at address %. MCx_Status contains % and MCx_Misc contains %.
    Note: If call home is enabled on the Lenovo xClarity Administrator, the Lenovo xClarity Administrator sends a notification to the Support team, which includes the collected service data.
  4. An alert is sent from the CMM to the Lenovo xClarity Administrator. In this case, the event is ignored because the same event was sent by the X Architecture compute node.

Alert flow (hardware released before 11/2014)

To understand how alerts flow through the Flex System chassis when an error is detected on a X Architecture compute node, consider an example in which the correctable Error Correction Code (ECC) memory error logging threshold for a X Architecture compute node is reached. This event is a predictive failure alert (PFA), which means that the X Architecture compute node will continue to function, but there could be a memory failure at some point. When the management processor on the X Architecture compute node detects this problem, the following actions occur:
  1. The problem is logged by the management processor in the system event log on the X Architecture compute node.
    The following example shows how the event might appear in the system event log:
    806F050C - 2001xxxx   - Memory Logging Limit Reached for DIMM_number on MemoryElementName.

    UEFI will also detect this problem, and it will log an event in the system log as well. The following example shows how a UEFI diagnostic event might look in the system event log:

    E.0058001   - PFA Threshold Exceeded.
  2. An alert is sent from the management processor on the X Architecture compute node to the CMM and posted to the event log on the CMM.
    The following example shows the error that might be posted to the CMM event log:
    0x77777773 - Compute_node_bay: The correctable Error Correct Code (ECC) memory logging threshold for the specified blade server was reached. The system will continue to run. Refer to the steps in the user response before replacing a DIMM.

    In addition, the IMM message is listed in the details of this event to enable you to correlate the message on the CMM with the message that appears in the system event log for the X Architecture compute node.

  3. An alert is also sent from the management processor on the X Architecture compute node to the Lenovo xClarity Administrator and posted to the event log on the Lenovo xClarity Administrator.
    The following example shows the error that might be posted to the event log on the Lenovo xClarity Administrator:
    806F050C - 2001xxxx   - Memory Logging Limit Reached for DIMM 1 on MemoryElementName.
    Note: If call home is enabled on the Lenovo xClarity Administrator, the Lenovo xClarity Administrator sends a notification to the Support team, which includes the collected service data.
  4. An alert is sent from the CMM to the Lenovo xClarity Administrator . In this case, the event is ignored because the same event was sent by the X Architecture compute node.