IBM Power AC922 8335-GTW Handbook

Problem analysis, system parts, and locations

Advertisement

Power Systems
Problem analysis, system parts, and
locations for the 8335-GTC, 8335-GTG,
8335-GTH, 8335-GTW, and 8335-GTX
IBM

Advertisement

Table of Contents
loading

Summary of Contents for IBM Power AC922 8335-GTW

  • Page 1 Power Systems Problem analysis, system parts, and locations for the 8335-GTC, 8335-GTG, 8335-GTH, 8335-GTW, and 8335-GTX...
  • Page 2 Note Before using this information and the product it supports, read the information in “Safety notices” on page v, “Notices” on page 45, the IBM Systems Safety Notices manual, G229-9054, and the IBM Environmental Notices and User Guide, Z125–5823. ®...
  • Page 3: Table Of Contents

    Finding parts and locations ................. 25 8335-GTC, 8335-GTG, 8335-GTH, 8335-GTW, or 8335-GTX locations..........25 8335-GTC, 8335-GTG, 8335-GTH, 8335-GTW, or 8335-GTX parts............31 Notices........................45 Accessibility features for IBM Power Systems servers................46 Privacy policy considerations ........................47 Trademarks..............................47 Electronic emission notices........................47 Class A Notices.............................
  • Page 5: Safety Notices

    Das Produkt ist nicht für den Einsatz an Bildschirmarbeitsplätzen im Sinne § 2 der Bildschirmarbeitsverordnung geeignet. Laser safety information IBM servers can use I/O cards or features that are fiber-optic based and that utilize lasers or LEDs. Laser compliance IBM servers may be installed inside or outside of an IT equipment rack.
  • Page 6 – For racks with a DC power distribution panel (PDP), connect the customer’s DC power source to the PDP. Ensure that the proper polarity is used when attaching the DC power and DC power return wiring. • Connect any equipment that will be attached to this product to properly wired outlets. •...
  • Page 7 • Rack-mounted devices are not to be used as shelves or work spaces. Do not place objects on top of rack-mounted devices. In addition, do not lean on rack mounted devices and do not use them to stabilize your body position (for example, when working from a ladder). •...
  • Page 8 CAUTION: Removing components from the upper positions in the rack cabinet improves rack stability during relocation. Follow these general guidelines whenever you relocate a populated rack cabinet within a room or building. • Reduce the weight of the rack cabinet by removing equipment starting at the top of the rack cabinet.
  • Page 9 DANGER: Rack-mounted devices are not to be used as shelves or work spaces. Do not place objects on top of rack-mounted devices. In addition, do not lean on rack-mounted devices and do not use them to stabilize your body position (for example, when working from a ladder). Stability hazard: •...
  • Page 10 DANGER: Multiple power cords. The product might be equipped with multiple AC power cords or multiple DC power cables. To remove all hazardous voltages, disconnect all power cords and power cables. (L003) (L007) CAUTION: A hot surface nearby. (L007) x Power Systems: Problem analysis, system parts, and locations for the 8335-GTC, 8335-GTG, 8335-GTH, 8335-GTW, and 8335-GTX...
  • Page 11 Exchange only with the IBM-approved part. Recycle or discard the battery as instructed by local regulations. In the United States, IBM has a process for the collection of this battery. For information, call 1-800-426-4333. Have the IBM part number for the battery unit available when you call.
  • Page 12 • LIFT TOOL intended for use to assist, lift, install, remove units (load) up into rack elevations. It is not to be used loaded transporting over major ramps nor as a replacement for such designated tools like pallet jacks, walkies, fork trucks and such related relocation practices. When this is not practicable, specially trained persons or services must be used (for instance, riggers or movers).
  • Page 13 Freewheeling will cause uneven cable wrapping around winch drum, damage cable, and may cause serious injury. • This TOOL must be maintained correctly for IBM Service personnel to use it. IBM shall inspect condition and verify maintenance history before operation. Personnel reserve the right not to use TOOL if inadequate.
  • Page 14 xiv Power Systems: Problem analysis, system parts, and locations for the 8335-GTC, 8335-GTG, 8335-GTH, 8335-GTW, and 8335-GTX...
  • Page 15: Beginning Troubleshooting And Problem Analysis

    Learn how to identify the correct problem analysis procedure to perform. About this task To determine the correct problem analysis procedure to perform, complete the following steps: Procedure 1. After you apply power to the system, are the power supply LEDs green? © Copyright IBM Corp. 2017, 2019...
  • Page 16: Resolving A Bmc Access Problem

    Then Yes: Continue with the next step. Go to “Resolving a power problem” on page 3. 2. Can you access the baseboard management controller (BMC) across the network? Then Yes: Continue with the next step. Go to “Resolving a BMC access problem” on page 2. 3.
  • Page 17: Resolving A Power Problem

    Note: If the IP address setting is incorrect, go to Configuring the BMC IP address. If the MAC address is 00:00:00:00:00:00, go to “Contacting IBM service and support” on page 24. 5. Complete the following actions: a. Power on to the Petitboot menu.
  • Page 18: Resolving A System Firmware Boot Failure

    Yes: Continue with the next step. Go to “Contacting IBM service and support” on page 24. This ends the procedure. 5. Perform the following actions, one at a time until the problem is resolved: a. Resolve any serviceable alerts that are in the event log. Go to Resolving a hardware problem.
  • Page 19: Resolving A Vga Monitor Problem

    Then Yes: Continue with the next step. Continue with step 4. 3. Complete the following actions: a. Update the system firmware. For instructions, see Getting fixes. b. Check the system event logs. For instructions, see “Identifying a service action by using system event logs”...
  • Page 20: Resolving An Operating System Boot Failure

    d) Verify that the system is powered on by activating a serial over LAN (SOL) session through the baseboard management controller (BMC). If the system is not active, go to “Resolving a system firmware boot failure” on page 4. e) Replace the system backplane. •...
  • Page 21: Resolving A Hardware Problem

    6. Was a service action identified? Then Yes: Continue with the next step. Go to “Collecting diagnostic data” on page 23. Then, go to “Contacting IBM service and support” on page 24. This ends the procedure. 7. Did the service action fix the problem? Then Yes: This ends the procedure.
  • Page 22 Then Continue with the next step. 2. To identify the correct service procedure to perform by using operating system log information, complete the following steps: a) Log in as the root user. b) To display the operating system logs, type dmesg and press Enter. 3.
  • Page 23: Resolving A Network Adapter Problem

    Table 1. Resource names, examples, and service procedures for different types of operating system logs. (continued) Resource name Example of a log Type of problem Service procedure requiring a service action Detected error on PCIe bus or adapter Resolve any device PHB#xxx, where xxx is driver errors that are the PHB number.
  • Page 24: Resolving A Graphics Processing Unit Problem

    If the network adapter is functioning again, review the IBM support tips to confirm that there are no PCI address, driver, or firmware conflicts. Then, reinstall the new adapters again one at a time until all adapters function properly.
  • Page 25 Table 3. GPU problems and service actions for the 8335-GTC, 8335-GTG, 8335-GTH, 8335-GTW, or 8335-GTX Problem Service action System unable to find GPU 1. Verify that the GPU is properly seated. 2. Verify that the drivers for the GPU are installed. 3.
  • Page 26 4. Does NPU chip 1 appear in the fence error log entry? • Yes: Continue with the next step. • No: Go to “Contacting IBM service and support” on page 24. This ends the procedure. 5. Replace the following items, one at a time, until the problem is resolved: Note: Go to “8335-GTC, 8335-GTG, 8335-GTH,...
  • Page 27: Resolving A Storage Device Problem

    Table 3. GPU problems and service actions for the 8335-GTC, 8335-GTG, 8335-GTH, 8335-GTW, or 8335-GTX (continued) Problem Service action GPU stops working suddenly 1. If the system was recently installed, moved, serviced, or upgraded, verify that the GPU is seated properly. 2.
  • Page 28: Resolving An Nvme Flash Adapter Problem

    Table 4. Storage device problems and service actions (continued) Problem Service action Drive stops working suddenly 1. Verify that all internal cables are properly seated and are not physically damaged. 2. Check the system logs to verify whether the system detected a problem. 3.
  • Page 29: Identifying The Location Of The Pcie Adapter By Using The Slot Number

    Flash adapter again. 2. Ensure that the latest I/O adapter firmware is installed. For instructions, see Getting firmware fixes for IBM I/O adapters by using Fix Central. 3. Ensure that you have the latest device driver service updates by installing the latest Linux distribution fixes.
  • Page 30: Identifying The Location Of The Gpu

    Yes: Continue with the next step. Go to “Collecting diagnostic data” on page 23. Then, go to “Contacting IBM service and support” on page 24. This ends the procedure. 3. You can determine the GPU slot information by using the lshw command. To determine the GPU slot, complete the following steps: a) Record the PCI bus information that is in the error message.
  • Page 31: Identifying The Location Of The Storage Device

    a) The operating system log contains information about the NVMe Flash adapter in the form of a PCI address. Record the PCI address information for the NVMe Flash adapter that failed. For example, in the operating system log message nvme 0006:01:00.0: Failed status: ffffffff, reset controller, the PCI address of the failing NVMe Flash adapter is 0006:01:00.0.
  • Page 32: Resolving An Over Temperature Problem For A Water-Cooled 8335-Gtw Or 8335-Gtx System

    Table 7. GPU and PCIe adapter user guides (continued) Name User guide QLogic QLogic website (http://driverdownloads.qlogic.com/QLogicDriverDownloads_UI/ IBM_Search.aspx) Resolving an over temperature problem for a water-cooled 8335-GTW or 8335-GTX system Learn how to identify the service action that is needed to resolve an over temperature problem. Procedure 1.
  • Page 33 8. Replace the cold plates. For instructions about how to replace the cold plates, see Removing and replacing the cold plates in the 8335-GTW or 8335-GTX. Does the problem persist? Then Yes: Go to “Contacting IBM service and support” on page 24. This ends the procedure. This ends the procedure. Beginning troubleshooting and problem analysis 19...
  • Page 34: Determining And Setting The Thermal Mode For An 8335-Gtg, 8335-Gth, Or 8335-Gtx System

    Determining and setting the thermal mode for an 8335-GTG, 8335-GTH, or 8335-GTX system Learn how to determine and set the thermal mode that is required for your system. You might be required to set the thermal mode of the system to a setting other than the default setting, depending on your system, adapter, and cable type.
  • Page 35: Identifying A Service Action By Using System Event Logs

    Table 9. Thermal mode setting for the 8335-GTX system (continued) Adapter feature code Adapter description Cable type Thermal mode EC6G PCIe4 x16, 2-port HDR Copper DEFAULT 100 Gb InfiniBand Optical HEAVY_IO ConnectX-6 adapter All other adapters and cable types DEFAULT Setting the thermal mode After you determine the thermal mode based on the system model, adapter, and cable type, choose one of the following options to set the thermal mode.
  • Page 36: Verifying A Repair

    Then Note: Alerts with a value of No that are displayed in the Serviceable column do not require service. 4. Starting with the first entry in the Active Alerts section with a value of Yes in the Serviceable column, complete the following steps until all entries are resolved: a.
  • Page 37: Collecting Diagnostic Data

    Perform a ping test to verify the network connectivity. Collecting diagnostic data Learn how to collect diagnostic data to send to IBM service and support. About this task To collect diagnostic data, complete the following steps: Procedure 1.
  • Page 38: Contacting Ibm Service And Support

    3. To collect system event logs, type the following command and press Enter: openbmctool -U <username> -P <password> -H <BMC IP address or BMC host name> collect_service_data 4. Send the data that you collected during this procedure to IBM service and support. This ends the procedure. Contacting IBM service and support You can contact IBM service and support by telephone or through the IBM Support Portal.
  • Page 39: Finding Parts And Locations

    Note: 8335-GTW and 8335-GTX or 8335-GTX. systems do not support this location. Fan 2 Fan 3 HDD 0 See Removing and replacing a disk drive in the 8335-GTC, HDD 1 8335-GTG, 8335-GTH, 8335- GTW, or 8335-GTX. © Copyright IBM Corp. 2017, 2019...
  • Page 40 Table 12. Front view locations (continued) Index number FRU description FRU removal and replacement procedures Power switch and cable • Removing and replacing the power switch and cable in the 8335-GTC • Removing and replacing the power switch and cable in the 8335-GTG or 8335-GTH •...
  • Page 41 Table 13. Top view locations Index number FRU description FRU removal and replacement procedures Disk drive and fan card • Removing and replacing the disk drive and fan card in the 8335-GTC • Removing and replacing the disk drive and fan card in the8335-GTG or 8335-GTH •...
  • Page 42 Table 13. Top view locations (continued) Index number FRU description FRU removal and replacement procedures Baseboard management See Removing and replacing the controller (BMC) card BMC card in the 8335-GTC, 8335-GTG, 8335-GTH, 8335- GTW, or 8335-GTX. Trusted platform module See Removing and replacing the trusted platform module in the Note: 8335-GTC, 8335-GTG, and 8335-GTH and 8335-GTX.
  • Page 43 Figure 3. Memory locations The following table provides the memory locations. Finding parts and locations 29...
  • Page 44 Table 15. Memory locations Index number FRU description FRU removal and replacement procedures DIMM 0 See Removing and replacing memory in the 8335-GTC, 8335- DIMM 1 GTG, or 8335-GTH or Removing and replacing memory in the DIMM 2 8335-GTW or 8335-GTX. DIMM 3 DIMM 4 DIMM 5...
  • Page 45: 8335-Gtc, 8335-Gtg, 8335-Gth, 8335-Gtw, Or 8335-Gtx Parts

    8335-GTC, 8335-GTG, 8335-GTH, 8335-GTW, or 8335-GTX parts Use this information to find the field-replaceable unit (FRU) part number. Rack final assembly Figure 4. Rack final assembly Table 16. Rack final assembly part numbers Index Part number Units per Description number assembly 45W8836 Fixed rail kit - contains left and right fixed rails and...
  • Page 46 Table 16. Rack final assembly part numbers (continued) Index Part number Units per Description number assembly 01EM209 Fixed rail kit - contains left and right fixed rails and attaching screws (8335-GTW) 00E4260 Slide rail kit - contains left and right slide rails and attaching screws (8335-GTC or 8335-GTG) 00E7329 Electronic Industries Association (EIA) bracket (right...
  • Page 47 System parts Figure 5. System parts Finding parts and locations 33...
  • Page 48 Table 17. System parts Index number Part number Units per Description assembly Top access cover assembly PCI adapters. Use the feature type of the adapter to find the FRU number in PCIe adapter information by feature type for the 8335-GTC, 8335-GTG, 8335-GTH, 8335-GTW, or 8335-GTX.
  • Page 49 Additional system parts (8335-GTC, 8335-GTG, or 8335-GTH air-cooled system) Figure 6. Additional system parts (8335-GTC, 8335-GTG, or 8335-GTH air-cooled system) Finding parts and locations 35...
  • Page 50 Table 18. Additional system parts (8335-GTC, 8335-GTG, or 8335-GTH air-cooled system) Index Part number Units per Description number assembly 01KL842 GPU kit (includes 16 GB GPU card, heat sink, and thermal interface material (TIM)) (8335-GTC) 01KL843 GPU kit (includes 16 GB GPU card, heat sink, and thermal interface material (TIM)) (8335-GTG or 8335- GTH) 02CL680...
  • Page 51 *8335-GTC, 8335-GTG, 8335-GTH, 8335-GTW, and 8335-GTX systems do not support mixing system processors with different DDx.y levels, different speeds, or differing numbers of cores. To determine the DDx.y level, type the following command and press Enter: openbmctool -U <username> -P <password> -H <BMC IP address or BMC host name> fru print The DDx.y level is the CPU version number in the format xy.
  • Page 52 Additional system parts (8335-GTW or 8335-GTX water-cooled system with 4 GPUs) Figure 7. Additional system parts (8335-GTW or 8335-GTX water-cooled system with 4 GPUs) 38 Power Systems: Problem analysis, system parts, and locations for the 8335-GTC, 8335-GTG, 8335-GTH, 8335-GTW, and 8335-GTX...
  • Page 53 Table 19. Additional system parts (8335-GTW or 8335-GTX water-cooled system with 4 GPUs) Index Part number Units per Description number assembly 01EM314 GPU kit (includes spreader assembly, 16 GB GPU card, air baffle, and TIM) (8335-GTW) 02CL351 GPU kit (includes spreader assembly, 16 GB GPU card, air baffle, and TIM) (8335-GTX) 02CL683 GPU kit (includes spreader assembly, 32 GB GPU card,...
  • Page 54 Table 19. Additional system parts (8335-GTW or 8335-GTX water-cooled system with 4 GPUs) (continued) Index Part number Units per Description number assembly 01EM321 DD2.1 18 core 2.900 GHz system processor module kit (includes system processor module, processor tray, 4 mm hex driver, module replacement tool, and air pump) (8335-GTW)* 02CL566 DD2.2 18 core 3.15 GHz system processor module kit...
  • Page 55 Additional system parts (8335-GTW or 8335-GTX water-cooled system with 6 GPUs) Figure 8. Additional system parts (8335-GTW or 8335-GTX water-cooled system with 6 GPUs) Finding parts and locations 41...
  • Page 56 Table 20. Additional system parts (8335-GTW or 8335-GTX water-cooled system with 6 GPUs) Index Part number Units per Description number assembly 01EM314 GPU kit (includes spreader assembly, 16 GB GPU card, air baffle, and TIM) (8335-GTW) 02CL351 GPU kit (includes spreader assembly, 16 GB GPU card, air baffle, and TIM) (8335-GTX) 02CL683 GPU kit (includes spreader assembly, 32 GB GPU card,...
  • Page 57 Table 20. Additional system parts (8335-GTW or 8335-GTX water-cooled system with 6 GPUs) (continued) Index Part number Units per Description number assembly 01EM321 DD2.1 18 core 2.900 GHz system processor module kit (includes system processor module, processor tray, 4 mm hex driver, module replacement tool, and air pump) (8335-GTW)* 02CL566 DD2.2 18 core 3.15 GHz system processor module kit...
  • Page 58 Table 21. Miscellaneous system parts (continued) Description Part number Units per assembly Time-of-day battery ( 8335-GTC, 00RY543 8335-GTG, 8335-GTH, 8335- GTW, or 8335-GTX) Baffle screw kit (8335-GTC, 8335- 01EM312 GTG, 8335-GTH, 8335-GTW, or 8335-GTX) Note: The screw kit includes one screw for each of the power supply air baffles and one screw for the empty GPU slot.
  • Page 59: Notices

    Consult your local IBM representative for information on the products and services currently available in your area. Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any IBM intellectual property right may be used instead.
  • Page 60: Accessibility Features For Ibm Power Systems Servers

    This product uses standard navigation keys. Interface information The IBM Power Systems servers user interfaces do not have content that flashes 2 - 55 times per second. The IBM Power Systems servers web user interface relies on cascading style sheets to render content properly and to provide a usable experience.
  • Page 61: Privacy Policy Considerations

    Contact the vendor for accessibility information about its products. Related accessibility information In addition to standard IBM help desk and support websites, IBM has a TTY telephone service for use by deaf or hard of hearing customers to access sales and support services:...
  • Page 62: Class A Notices

    Class A Notices The following Class A statements apply to the IBM servers that contain the POWER9 processor and its features unless designated as electromagnetic compatibility (EMC) Class B in the feature information. Federal Communications Commission (FCC) Statement Note: This equipment has been tested and found to comply with the limits for a Class A digital device, pursuant to Part 15 of the FCC Rules.
  • Page 63 Japan Electronics and Information Technology Industries Association Statement This statement explains the Japan JIS C 61000-3-2 product wattage compliance. This statement explains the Japan Electronics and Information Technology Industries Association (JEITA) statement for products less than or equal to 20 A per phase. This statement explains the JEITA statement for products greater than 20 A, single phase.
  • Page 64 Um dieses sicherzustellen, sind die Geräte wie in den Handbüchern beschrieben zu installieren und zu betreiben. Des Weiteren dürfen auch nur von der IBM empfohlene Kabel angeschlossen werden. IBM übernimmt keine Verantwortung für die Einhaltung der Schutzanforderungen, wenn das Produkt ohne Zustimmung von IBM verändert bzw.
  • Page 65: Class B Notices

    Properly shielded and grounded cables and connectors must be used in order to meet FCC emission limits. Proper cables and connectors are available from IBM-authorized dealers. IBM is not responsible for any radio or television interference caused by unauthorized changes or modifications to this equipment.
  • Page 66 Tel: +49 800 225 5426 email: halloibm@de.ibm.com VCCI Statement - Japan Japan Electronics and Information Technology Industries Association Statement This statement explains the Japan JIS C 61000-3-2 product wattage compliance. This statement explains the Japan Electronics and Information Technology Industries Association (JEITA) statement for products less than or equal to 20 A per phase.
  • Page 67: Terms And Conditions

    Permissions for the use of these publications are granted subject to the following terms and conditions. Applicability: These terms and conditions are in addition to any terms of use for the IBM website. Personal Use: You may reproduce these publications for your personal, noncommercial use provided that all proprietary notices are preserved.
  • Page 68 IBM reserves the right to withdraw the permissions granted herein whenever, in its discretion, the use of the publications is detrimental to its interest or, as determined by IBM, the above instructions are not being properly followed.
  • Page 70 IBM®...

Table of Contents