NEC Express5800/A1040b User Manual
NEC Express5800/A1040b User Manual

NEC Express5800/A1040b User Manual

Machine check monitoring service
Hide thumbs Also See for Express5800/A1040b:

Advertisement

Express5800/A2040b,A2020b,
A2010b,A1040b
Machine Check Monitoring Service
User's Guide
(Release 1.5)
May 2015
NEC Corporation
© 2015 NEC Corporation
855-900937

Advertisement

Table of Contents
loading

Summary of Contents for NEC Express5800/A1040b

  • Page 1 Express5800/A2040b,A2020b, A2010b,A1040b Machine Check Monitoring Service User's Guide (Release 1.5) May 2015 NEC Corporation © 2015 NEC Corporation 855-900937...
  • Page 2 Notes on Using This Manual  No part of this manual may be reproduced in any form without the prior written permission of NEC Corporation.  The contents of this manual may be revised without prior notice.  The contents of this manual shall not be copied or altered without the prior written permission of NEC Corporation.
  • Page 3: Table Of Contents

    Contents Introduction ..........................1 Overview ..........................1 Operating Environment ..................... 1 Terminology ........................2 Access Limitation ....................... 2 Features of Machine Check Monitoring Service ..............3 Features of Machine Check Monitoring Service ............... 3 System Configuration of Machine Check Monitoring Service ........... 3 Functional Drawing of Machine Check Monitoring Service ..........
  • Page 4 Logging Destination ......................18 Output Format ......................... 18 Command Reference ......................19 Show CPU / Memory Status .................... 19 Messages ..........................22 On-screen Message ......................22 6.1.1 On-screen messages output from mcemonitor ............22 6.1.2 On-screen messages output from capmonitor ............24 6.1.3 On-screen messages output from acpi_call ..............
  • Page 5: Introduction

    Refer to "Capacity Optimization (COPT) User's Guide" for details of Core Note Online feature.  Core Offline, Core Online, and Page Offline are not supported on Express5800/A1040b. Operating Environment Machine Check Monitoring Service requires operating environment as shown below: Table 1-1 Operating Environment Hardware...
  • Page 6: Terminology

    Terminology Terms used in Machine Check Monitoring Service are as shown below: Table 1-2 Terminology Term Description mcemonitor Software that realizes higher RAS feature. When mcemonitor receives logs from mce mechanism of Linux kernel, analyze it, and monitors fault occurrence in cooperation with system. mcemonitor instructs Core Offline and Page Offline to the kernel.
  • Page 7: Features Of Machine Check Monitoring Service

    Refer to "Capacity Optimization (COPT) User's Guide" for details of Core Note Online feature.  Express5800/A1040b does not support Core Offline, Core Online, and Page Offline. System Configuration of Machine Check Monitoring Service The system configuration of Machine Check Monitoring Service is shown below.
  • Page 8: Functional Drawing Of Machine Check Monitoring Service

    Functional Drawing of Machine Check Monitoring Service Functional drawing of Machine Check Monitoring Service and its associated components are shown below. Figure 2-2 Functional drawing mcemonitor (log) syslog capmonitor (log) mcemonitor capmonitor acpi_call kernel Firmware Hardware Fault Memory...
  • Page 9: Features Of Machine Check Monitoring Service

    When CPU Offline succeeds, the relevant CPU is disabled for OS and software. Thus, the number of available CPUs is reduced. Note: Express5800/A1040b does not support Core Offline feature. mcemonitor notifies the firmware of result of CPU Offline. When CPU Offline succeeds and if the server has spare CPU, the spare CPU is added automatically (Core Online feature).
  • Page 10: Installation And Configuration

    Login to the target machine as a root user. The most recent version of RPM are available for download from the following website. http://www.58support.nec.co.jp/global/download/index.html Install acpi_call RPM package of Machine Check Monitoring Service using rpm command. # rpm -ivh mcl-acpicall-2.4-3.01.2.6.32.504.23.4.el6.x86_64.rpm Preparing...
  • Page 11 Configure /etc/sysconfig/kdump. Creation of initrd file for kdump may fail if an external module unnecessary for dump collection is incorporated. To prevent this, add MKDUMPRD_ARGS="--allow-missing". Sample configuration of /etc/sysconfig/kdump MKDUMPRD_ARGS="--allow-missing" With this configuration, the following warning may appear when kdump service is started. This message indicates that the external module was not incorporated, and it is not the problem.
  • Page 12: Installing Capmonitor

    Check if capmonitor is started normally. If the following is displayed, capmonitor is started normally. # ps aux | grep monitor root 6044 0.0 0.0 4068 324 ? Ss 06:18 0:00 /opt/nec/capmonitor/capmonitor Installation of package may not complete if the following message is displayed. Repeat from Step 3 according to "Solution".
  • Page 13: Installing Mcemonitor

    Check if mcemonitor is started normally. If the following is displayed, mcemonitor is started normally. # ps aux | grep monitor root 6078 0.0 0.0 4076 328 ? Ss 06:19 0:00 /opt/nec/mcemonitor/mcemonitor Installation of package may not complete if the following message is displayed. Repeat from Step 3 according to "Solution".
  • Page 14: Upgrade

    Copy RPM to desired directory in target machine. The most recent version of RPM is available for download from the following website. http://www.58support.nec.co.jp/global/download/index.html Upgrade acpi_call RPM package of Machine Check Monitoring Service using rpm command. # rpm -Uvh mcl-acpicall-2.4-3.02.2.6.32.504.23.4.el6.x86_64.rpm Preparing...
  • Page 15: Upgrading Capmonitor

    The following is displayed when upgrade completes successfully. # rpm -qa | grep capmonitor mcl-capmonitor-2.4-2.13.el6.x86_64 Check if capmonitor is started normally. If the following is displayed, capmonitor is started normally. # ps aux | grep monitor root 4141 0.0 0.0 4068 352 ? Ss 13:54 0:00 /opt/nec/capmonitor/capmonitor...
  • Page 16: Upgrading Mcemonitor

    The following is displayed when upgrade completes successfully. # rpm -qa | grep mcemonitor mcl-mcemonitor1-2.4-2.03.el6.x86_64 Check if mcemonitor is started normally. If the following is displayed, mcemonitor is started normally. # ps aux | grep monitor root 4189 0.0 0.0 4076 364 ? Ss 13:56 0:00 /opt/nec/mcemonitor/mcemonitor...
  • Page 17: Configuration

     /opt/nec/capmonitor/conf/capmonitor.conf  /opt/nec/mcemonitor/conf/mcemonitor.conf 3.3.1 capmonitor configuration file capmonitor configuration file /opt/nec/capmonitor/conf/capmonitor.conf is used for configuration related to CPU Core Online. For details of capmonitor configuration file, refer to "Capacity Optimization Note (COPT) User's Guide".
  • Page 18: Disabling Cmci

    Table 3-1 mcemonitor configuration file(core-ce-action) Setting in mcemonitor.conf Description core-ce-action = soft Collects log and makes CPU Core Offline if the CPU error count exceeds the threshold value. (Default) core-ce-action = account Collects log but does not make CPU Core Offline even if the CPU error count exceeds the threshold value.
  • Page 19: Script File To Be Executed After Core Offline

    Place the script /opt/nec/capmonitor/script/03kdump.sh under the directory /opt/nec/capmonitor/script/cpu/offline.d to restart kdump as an alternative of kdump that was disabled in 3.3.4. If you use the software that requires reboot after Core Offline (number of logical processors is reduced), create a script file containing the necessary processes and store it under the directory /opt/nec/capmonitor/script/cpu/offline.d.
  • Page 20: Uninstallation

    Uninstallation Use rpm command to uninstall Machine Check Monitoring Service. Uninstall packages mcemonitor, capmonitor, and acpi_call in order. 3.4.1 Uninstalling acpi_call Login to the target machine as a root user. Uninstall acpi_call RPM package of Machine Check Monitoring Service using rpm command. # rpm -e mcl-acpicall-2.4-3.01.2.6.32.504.23.4.el6.x86_64 Note: mcemonitor and capmonitor must be uninstalled before uninstalling acpi_call.
  • Page 21: Uninstalling Capmonitor

    Uninstall capmonitor RPM package of Machine Check Monitoring Service using rpm command. # rpm -e mcl-capmonitor-2.4-2.12.el6.x86_64 3834 /opt/nec/capmonitor/capmonitor Stopping capmonitor[ OK ] Confirm that capmonitor RPM package of Machine Check Monitoring Service is uninstalled correctly. Uninstallation completes successfully if "capmonitor" is not displayed as shown below.
  • Page 22: Log

    Machine Check Monitoring Service outputs log to the following destinations:  /var/opt/nec/mcemonitor (Fault monitoring log)  /var/opt/nec/capmonitor (Core Offline log (including logs related to Core Online of COPT) Output Format Shown below is an example of log that Machine Check Monitoring Service outputs.
  • Page 23: Command Reference

    Command Reference Show CPU / Memory Status You can view CPU fault information and offline state of CPU/Memory page by using mcemonitor command. The following shows command options. Name mcemonitor – Outputs state of CPU / Memory page to standard output. Syntax mcemonitor [ --version ] mcemonitor [ --client | --client=core | --client=page ]...
  • Page 24 Display format # /opt/nec/mcemonitor/mcemonitor --client Per page status corrected error over threshold: 100000: offline-failed 10000000: offline 20000000: offline Per page status uncorrected error: 1abc40000 1abc90000 CPU errors CPU1/core2 corrected errors: 1 total uncorrected errors: 0 total CPU4/core1 corrected errors: 10 total...
  • Page 25 Table 5-1 mcemonitor command Item name Description Per page status corrected error over threshold: Shows result of Memory Page Offline. Indicates that offlining failed for 0x10000 page of 100000: offline-failed memory address. Indicates 0x10000 page of memory address was 10000000: offline offlined.
  • Page 26: Messages

    Table 6-1 On-screen messages output from mcemonitor Message Meaning Action Cannot open logfile Failed to open log file, and Restart mcemonitor /var/opt/nec/mcemonitor mcemonitor exited. automatically by cron. mcemonitor exited due to a mcemonitor is restarted by cron. system error. mcemonitor will be restarted by cron.
  • Page 27 Reinstall mcemonitor. md was not found. cmd was not found, thus failed to start mcemonitor. /var/opt/nec was not found. /var/opt/nec was not found, thus Reinstall mcemonitor. failed to start mcemonitor. Unknown mcemonitor mode xx. mcemonitor is not in daemon Specify daemon for Valid daemon mode.
  • Page 28: On-Screen Messages Output From Capmonitor

    Table 6-2 On-screen messages output from capmonitor Message Meaning Action Cannot open logfile Failed to open log file, and mcemonitor restarts /var/opt/nec/capmonitor capmonitor exited. automatically by cron. capmonitor exited due to a system capmonitor is restarted by cron. error. capmonitor will be restarted by cron.
  • Page 29 Reinstall capmonitor. md was not found. d was not found, thus failed to start capmonitor. /var/opt/nec was not found. /var/opt/nec was not found, thus Reinstall capmonitor. failed to start capmonitor. Unknown capmonitor mode xx. Unknown mode. Only daemon Specify daemon for Valid daemon mode is valid.
  • Page 30: On-Screen Messages Output From Acpi_Call

    The following table shows on-screen message that acpi_call outputs. Table 6-3 On-screen messages output from acpi_call Message Meaning Action insmod: can't read Failed to load acpi_capcall.ko Reinstall acpi_call. '/opt/nec/acpicall/proc/acpi/capcall because /acpi_capcall.ko': No such file or /opt/nec/acpicall/proc/acpi/capcall directory /acpi_capcall.ko was not found. insmod: can't read Failed to load acpi_clpcall.ko Reinstall acpi_call.
  • Page 31: Other On-Screen Messages

    6.1.4 Other on-screen messages The following table shows on-screen message related to Machine Check Monitoring Service. Table 6-4 Other on-screen messages Message Meaning Action Disabling ondemand cpu cpuspeed end processing was It is not a problem if frequency scaling: not executed to CPU xx because cpuspeed end processing /etc/rc0.d/K99cpuspeed: line 288: CPUxx is offlined.
  • Page 32: Operation Log Messages

    Operation Log Messages 6.3.1 Operation log messages output from mcemonitor The following table shows operation log message (related to fault monitoring) that mcemonitor outputs. Table 6-6 Operation log messages output from mcemonitor Message Meaning Action Operation Log Error: 1003 <error cause> An error occurred on Restart mcemonitor system-related function, and...
  • Page 33 Message Meaning Action Error: 1033 <error cause> An error occurred on Restart mcemonitor system-related function, and automatically by cron. mcemonitor exited due to a system mcemonitor exited. mcemonitor error. mcemonitor will be restarted is restarted by cron. by cron. Error: 1034 An error occurred on mcemonitor continues system-related function, but...
  • Page 34 Message Meaning Action Warning: 1046 Failed to read memory-ce-action mcemonitor will run with and core-ce-action of default values of memory-ce-action and mcemonitor.conf. memory-ce-action and core-ce-action values are core-ce-action. unspecified in mcemonitor.conf. mcemonitor will run with default mcemonitor will run with default values of memory-ce-action and Review the setting values of value.
  • Page 35 Error: 5036 <error cause> An error occurred on system-related function, but /opt/nec/mcemonitor/mcemo mcemonitor will continue to be run mcemonitor continue operation. nitor --client again. safely. Please retry operation. Run the command again.
  • Page 36 Message Meaning Action Cannot open /dev/mcelog. <error An error occurred on Restart mcemonitor cause> system-related function, and automatically by cron. mcemonitor exited. mcemonitor mcemonitor exited due to a system is restarted by cron. error. mcemonitor will be restarted by cron. MCE_GET_RECORD_LEN <error An error occurred on Restart mcemonitor...
  • Page 37 Message Meaning Action cannot bind to NETLINK socket An error occurred on Restart mcemonitor <error cause> system-related function, and automatically by cron. mcemonitor exited. mcemonitor mcemonitor exited due to a system is restarted by cron. error. mcemonitor will be restarted by cron.
  • Page 38 Message Meaning Action cannot open listening socket An error occurred on Restart mcemonitor <error cause> system-related function, and automatically by cron. mcemonitor exited. mcemonitor mcemonitor exited due to a system is restarted by cron. error. mcemonitor will be restarted by cron. cannot bind to client unix socket An error occurred on Restart mcemonitor...
  • Page 39: Operation Log Messages Output From Capmonitor

    6.3.2 Operation log messages output from capmonitor The following table shows operation log message (Core Offline log (including logs related to Core Online of COPT) that capmonitor outputs. Table 6-7 Operation log messages output from capmonitor Message Meaning Solution Operation log Error: 1102 <error cause>...
  • Page 40 Message Meaning Solution Warning: 1111 <error type> Failed to read capmonitor will run with cpu-hotadd-timeout of default value of cpu-hotadd-timeout value is capmonitor.conf. cpu-hotadd-timeout. unspecified in capmonitor.conf. capmonitor will run with default capmonitor will run with default Review the setting value of value.
  • Page 41 Error: 6100 <error cause> An error occurred on system-related function but /opt/nec/capmonitor/capmoni capmonitor will continue to be run capmonitor continue operation. tor --client=addtime again. safely. Please retry operation. Run the command again.
  • Page 42 Message Meaning Solution Cannot open pidfile <error cause> An error occurred on mcemonitor restarts system-related function and automatically by cron. capmonitor exited due to a system capmonitor terminated. error. capmonitor will be restarted capmonitor is restarted by cron. by cron. –...
  • Page 43: Restrictions And Precautions

    Restrictions and Precautions Manual Onlining CPU being Core Offlined Do not manually online the CPU that was core offlined. When correctable error exceeds threshold value, Machine Check Monitoring Service offlines the failed core. The core offlined CPU cannot be used by OS. You can online the core from OS (*), however, the offlined CPU is failing.
  • Page 44 (Release 1.5) NEC Corporation 7-1 Shiba 5-Chome, Minato-Ku Tokyo 108-8001, Japan TEL (03) 3454-1111 (Main phone number) © NEC Corporation 2015 No part of this manual may be reproduced in any form without the prior written permission of NEC Corporation.

This manual is also suitable for:

Express5800/a2010bExpress5800/a2020bExpress5800/a2040b

Table of Contents