Nvidia DGX A100 User Manual

Nvidia DGX A100 User Manual

Hide thumbs Also See for DGX A100:
Table of Contents

Advertisement

DGX A100 System
DU-09821-001_v01 | May 2020
User Guide

Advertisement

Table of Contents
loading
Need help?

Need help?

Do you have a question about the DGX A100 and is the answer not in the manual?

Questions and answers

Summary of Contents for Nvidia DGX A100

  • Page 1 DGX A100 System DU-09821-001_v01 | May 2020 User Guide...
  • Page 2: Table Of Contents

    1.5 Customer Support ............. 13 1.5.1 NVIDIA Enterprise Support Portal........13 1.5.2 NVIDIA Enterprise Support Email ........13 1.5.3 NVIDIA Enterprise Support - Local Time Zone Phone Numbers..13 Chapter 2. Connecting to the DGX A100........14 2.1 Connecting to the Console ........... 14 2.1.1 Direct Connection ............
  • Page 3 5.1 Managing the DGX Crash Dump Feature ........30 5.1.1 Using the Script ............30 5.1.2 Connecting to Serial Over LAN......... 31 Chapter 6. Managing the DGX A100 Self-Encrypting Drives ....32 6.1 Overview................ 32 6.2 Installing the Software ............33 6.3 Initializing the System for Drive Encryption.......
  • Page 4 9.1.2 Update Instructions ............. 55 9.2 Restoring the DGX A100 Software Image........56 9.2.1 Obtaining the DGX A100 Software ISO Image and Checksum File 56 9.2.2 Re-Imaging the System Remotely ........57 9.2.3 Creating a Bootable Installation Medium ......58 9.2.4 Creating a Bootable USB Flash Drive by Using the dd Command .
  • Page 5 Appendix A. Installing Software on Air-gapped DGX A100 Systems..85 A.1 Installing NVIDIA DGX A100 Software ........85 A.2 Re-Imaging the System ............86 A.3 Creating a Local Mirror of the NVIDIA and Canonical Repositories ... 86 A.3.1 Create Mirrors............86 A.3.2 Configure the Target System........... 88 A.4 Installing Docker Containers..........
  • Page 6 C.1 United States ..............100 C.2 United States / Canada .............101 C.3 Canada .................101 C.4 CE................102 C.5 Australia and New Zealand ..........103 C.6 Brazil ................103 C.7 Japan................103 C.8 South Korea ..............106 C.9 China ................109 C.10 Taiwan ................111 C.11 Russia/Kazakhstan/Belarus..........113 DGX A100 System DU-09821-001_v01...
  • Page 7: Chapter 1 Introduction

    CHAPTER 1 INTRODUCTION The NVIDIA DGX™ A100 system is the universal system purpose-built for all AI infrastructure and workloads, from analytics to training to inference. The system is built on eight NVIDIA A100 Tensor Core GPUs. This document is for users and administrators of the DGX A100 system.
  • Page 8: Hardware Overview

    1 GbE RJ45 interface management Power Supply 3 kW 1.1.2 Mechanical Specifications Feature Description Form Factor 6U Rackmount Height 10.39” (264 mm) Width 19" (482.3 mm) Depth 35.32" (897.2 mm) System Weight 271 lbs (123 kg) DGX A100 System User Guide DU-09821-001_v01...
  • Page 9: Power Specifications

    Note: The DGX A100 will not operate with less than three PSUs. DGX A100 Locking Power Cords The DGX A100 is shipped with a set of six (6) locking power cords that have been qualified for use with the DGX A100 to ensure regulatory compliance.
  • Page 10 To UNLOCK the power cord, move the switch to the unlocked position (indicator will show GREEN) To LOCK the power cord, move the switch to the locked position (indicator should show only RED) DGX A100 System User Guide DU-09821-001_v01...
  • Page 11: Environmental Specifications

    1.1.5.1 With Bezel Control Description Power Button Press to turn the DGX A100 system On or Off Green flashing (1 Hz): Standby (BMC booted) Green flashing (4 Hz): POST in progress Green solid On: Power On ID Button Press to cause the button blue LED to turn On or blink (configurable through the BMC) as an identifier during servicing.
  • Page 12: Rear Panel Modules

    Chapter 1 : Introduction 1.1.5.2 With Bezel Removed IMPORTANT:See the section “Turning DGX A100 On and Off” instructions on how to properly turn the system on or off. 1.1.6 Rear Panel Modules DGX A100 System User Guide DU-09821-001_v01...
  • Page 13: Motherboard Connections And Controls

    Blinks when ID button is pressed from the front of the unit as an aid in identifying the unit needing servicing BMC Reset button Press to manually reset the BMC “Configuring Network Proxies” for details on the network connections. 1.1.8 Motherboard Tray Components DGX A100 System User Guide DU-09821-001_v01...
  • Page 14: Gpu Tray Components

    Chapter 1 : Introduction 1.1.9 GPU Tray Components DGX A100 System User Guide DU-09821-001_v01...
  • Page 15: Network Connections, Cables, And Adaptors

    4 port 1 (bottom) e1:00.1 enp225s0f1 (See note) 5 port 0 (left) 61:00.0 enp97s0f0 (See note 5 port 1 (right) 61:00.1 enp97s0f1 (See note) 0c:00.0 enp12s0 12:00.0 enp18s0 8d:00.1 enp141s0 94:00.0 enp148s0 e2:00.0 enp226s0 DGX A100 System User Guide DU-09821-001_v01...
  • Page 16: Supported Network Cables

    DGX A100. The ConnectX-6 firmware determines which cables are supported. 1.2.3 Supported Network Adaptors To connect the DGX A100 system to an existing 10 or 25 GbE network, you can purchase the following adaptors from NVIDIA. Component Mellanox MPN...
  • Page 17: Dgx Os Software

    Chapter 1 : Introduction DGX OS SOFTWARE The DGX A100 system comes pre-installed with a DGX software stack incorporating  An Ubuntu server distribution with supporting packages  The NVIDIA GPU driver  Docker Engine  NVIDIA Container Toolkit ...
  • Page 18: Additional Documentation

    How to access the NGC container registry for using containerized deep learning GPU- accelerated applications on your DGX A100 system.  NVSM Software User Guide Contains instructions for using the NVIDIA System Management software.  DCGM Software User Guide Contains instructions for using the Data Center GPU Manager software.
  • Page 19: Customer Support

    Contact NVIDIA Enterprise Support for assistance in reporting, troubleshooting, or diagnosing problems with your DGX A100 system . Also contact NVIDIA Enterprise Support for assistance in installing or moving the DGX A100 system. You can contact NVIDIA Enterprise Support in the following ways.
  • Page 20: Chapter 2. Connecting To The Dgx A100

    2.1.1 Direct Connection At either the front or the back of the DGX A100 system, connect a display to the VGA connector, and a keyboard to any of the USB ports. Note: The display resolution must be 1440x900 or lower.
  • Page 21 Chapter 2 : Connecting to the DGX A100 DGX A100 Server Front DGX A100 Server Rear DGX A100 System User Guide DU-09821-001_v01 | 15...
  • Page 22: Remote Connection Through The Bmc

    BMC. The BMC username is the same as the administrator username Username: <administrator-username> Password: <bmc-password> Make sure you have connected the BMC port on the DGX A100 system to your LAN. Open a browser within your LAN and go to: https://<bmc-ip-address>/ Make sure popups are allowed for the BMC address.
  • Page 23 From the left-side navigation menu, click Remote Control. The Remote Control page allows you to open a virtual Keyboard/Video/Mouse (KVM) on the DGX A100 system, as if you were using a physical monitor and keyboard connected to the front of the system.
  • Page 24: Ssh Connection To The Os

    Chapter 2 : Connecting to the DGX A100 SSH CONNECTION TO THE OS You can also establish an SSH connection to the DGX A100 OS through the network port. See the section “Network Ports” to identify the port to use, and the section “Configuring...
  • Page 25: Chapter 3. First-Boot Setup

    While NVIDIA partner network personnel or NVIDIA field service engineers will install the DGX A100 system at the site and perform the first boot setup, the first boot setup instructions are provided here for reference and to support any re-imaging of the server.
  • Page 26 • Using the Remote BMC The system will take a few minutes to boot. You are presented with end user license agreements (EULAs) for the NVIDIA software. Accept the EULA to proceed with the installation. Perform the steps to configure the DGX A100 software.
  • Page 27 Chapter 3 : First-Boot Setup • Choose a primary network interface for the DGX A100 system; for example, enp226s0. This should typically be the interface that you will use for subsequent system configuration or in-band management. After you select the primary network interface, the system attempts to configure the interface for DHCP and then asks you to enter the name server addresses.
  • Page 28: Chapter 4 Quick Start And Basic Operation

    Installation Partner. REGISTRATION To obtain support for your DGX A100 system, follow the instructions for registration in the Entitlement Certification email that was sent as part of the purchase. Registration allows you access to the NVIDIA Enterprise Support Portal, technical...
  • Page 29: Obtaining An Ngc Account

    Observe the following startup and shutdown instructions. 4.4.1 Startup Considerations In order to keep your DGX A100 running smoothly, allow up to a minute of idle time after reaching the login prompt. This ensures that all components are able to complete their initialization. 4.4.2...
  • Page 30: Verifying Functionality

    VERIFYING FUNCTIONALITY 4.5.1 Quick Health Check This section walks you through the steps of performing a health check on the DGX A100 System, and verifying the Docker and NVIDIA driver installation. Establish an SSH connection to the DGX A100 System.
  • Page 31: Running Ngc Containers With Gpu Support

    Chapter 4 : Quick Start and Basic Operation RUNNING NGC CONTAINERS WITH GPU SUPPORT To obtain the best performance when running NGC containers on DGX A100 systems, two methods of providing GPU support for Docker containers have been developed: ...
  • Page 32: Using The Nvidia Container Runtime For Docker

     Use docker run with nvidia as the default runtime. You can set nvidia as the default runtime, for example, by adding the following line to the /etc/docker/daemon.json configuration file as the first entry. "default-runtime": "nvidia", The following is an example of how the added line appears in the JSON file.
  • Page 33 Chapter 4 : Quick Start and Basic Operation $ docker run ... CAUTION: If you build Docker images while nvidia is set as the default runtime, make sure the build scripts executed by the Dockerfile specify the GPU architectures that the container will need. Failure to do so may result in the container being optimized only for the GPU architecture on which it was built.
  • Page 34: Managing Cpu Mitigations

    CPU mitigations are disabled if the output consists of multiple lines prefixed with Vulnerable. Example KVM: Vulnerable Mitigation: PTE Inversion; VMX: vulnerable Vulnerable; SMT vulnerable Vulnerable Vulnerable Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers Vulnerable, IBPB: disabled, STIBP: disabled Vulnerable DGX A100 System User Guide DU-09821-001_v01 | 28...
  • Page 35: Disabling Cpu Mitigations

    $ sudo apt purge nv-mitigations-off Reboot the system. Verify CPU mitigations are enabled. $ cat /sys/devices/system/cpu/vulnerabilities/* The output should include several Mitigations lines. See “Determining the CPU Mitigation State of the DGX System” for example output. DGX A100 System User Guide DU-09821-001_v01 | 29...
  • Page 36: Chapter 5 Additional Features And Instructions

    CHAPTER 5 ADDITIONAL FEATURES AND INSTRUCTIONS This chapter describes specific features of the DGX A100 server to consider during setup and operation. MANAGING THE DGX CRASH DUMP FEATURE The DGX OS includes a script to manage this feature. 5.1.1 Using the Script ...
  • Page 37: Connecting To Serial Over Lan

    To view the console output during the crash dump, connect to serial over LAN as follows: $ ipmitool -I lanplus -H <bmc-ip-address> -U <BMC-USERNAME> -P <BMC- PASSWORD> sol activate DGX A100 System User Guide DU-09821-001_v01 | 31...
  • Page 38: Chapter 6. Managing The Dgx A100 Self-Encrypting Drives

    The NVIDIA DGX™ OS software supports the ability to manage self-encrypting drives (SEDs), including setting an Authentication Key for locking and unlocking the drives on NVIDIA DGX™ A100 systems. You can manage only the SED data drives. The software cannot be used to manage OS drives even if they are SED-capable.
  • Page 39: Installing The Software

    Chapter 6 : Managing the DGX A100 Self-Encrypting Drives  Once initialized, SEDs are locked upon power loss, such as a system shutdown or drive removal. Locked drives get unlocked after power is restored and the root file system is mounted.
  • Page 40: Enabling Drive Locking

    Chapter 6 : Managing the DGX A100 Self-Encrypting Drives ENABLING DRIVE LOCKING After initializing the system for SED management, use the nv-disk-encrypt command to enable drive locking by issuing the following. $ sudo nv-disk-encrypt lock After initializing the system and enabling drive locking, the drives will become locked when they lose power.
  • Page 41 Chapter 6 : Managing the DGX A100 Self-Encrypting Drives isk(s) that cannot be used for encryption +--------------+---------+--------------------------------------------------------------------------+ Name | Serial Status +--------------+---------+--------------------------------------------------------------------------+ | /dev/nvme0n1 | xxxxx1 | SED capable = Y, Boot disk = Y, Locked = N, Lock Enabled = N, MBR done = N |...
  • Page 42: Example 2: Generating Random Passwords

    Chapter 6 : Managing the DGX A100 Self-Encrypting Drives $ sudo nv-disk-encrypt init -f /tmp/<your-file>.json -g $ sudo nv-disk-encrypt lock Provide a password for the vault when prompted. Passwords must consist of only upper-case letters, lower-case letters, digits, and/or the following special-characters: ~ : @ % ^ + = _ , Delete the JSON file in the temporary location for security.
  • Page 43: Disabling Drive Locking

    ERASING YOUR DATA CAUTION: Be aware when executing this that all data will be lost. On DGX A100 systems, these drives generally form a RAID 0 array - this will also be destroyed when performing an erase. After initializing the system for SED management, use the nv-disk-encrypt command to erase data on your drives by issuing the following.
  • Page 44: Using The Trusted Platform Module

    You need to access the system BIOS to enable or disable the TPM. 6.9.1 Enabling the TPM The DGX A100 system is shipped with the TPM disabled. To enable the TPM, do the following. Reboot the DGX A100, then press [Del] or [F2] at the NVIDIA splash screen to enter the BIOS Setup.
  • Page 45: Changing Disk Passwords, Adding Disks, Or Replacing Disks

    Be sure to rebuild the RAID array after unlocking the drive. 6.12 RECOVERING FROM LOST KEYS NVIDIA recommends backing up your keys and storing them in a secure location. If you’ve lost the key used to initialize and lock your drives, you will not be able to unlock the drive again.
  • Page 46 Chapter 6 : Managing the DGX A100 Self-Encrypting Drives Specify the PSID to reset the drive using the following sedutil-cli command: $ sudo sedutil-cli --yesIreallywanttoERASEALLmydatausingthePSID <your- drive-PSID> /dev/nvme3n1 DGX A100 System User Guide DU-09821-001_v01 | 40...
  • Page 47: Chapter 7 Network Configuration

    CHAPTER 7 NETWORK CONFIGURATION This chapter describes key network considerations and instructions for the DGX A100 System. CONFIGURING NETWORK PROXIES If your network requires use of a proxy server, you will need to set up configuration files to ensure the DGX A100 System communicates through the proxy.
  • Page 48: For Apt

    Docker IP address range, then no changes are needed and you can skip this section. However, if your network uses the addresses within this range for the DGX A100 System, you should change the default Docker network addresses.
  • Page 49 ExecStart=/usr/bin/dockerd -H fd:// -s overlay2 --bip=192.168.127.1/24 --fixed-cidr=192.168.127.128/25 LimitMEMLOCK=infinity LimitSTACK=67108864 Save and close the /etc/systemd/system/docker.service.d/docker- override.conf file when done. Reload the systemctl daemon. $ sudo systemctl daemon-reload Restart Docker. $ sudo systemctl restart docker DGX A100 System User Guide DU-09821-001_v01 | 43...
  • Page 50: Opening Ports

    If port 443 is proxied through a corporate firewall, then WebSocket protocol traffic must be supported CONNECTIVITY REQUIREMENTS FOR NGC CONTAINERS To run NVIDIA NGC containers from the NGC container registry, your network must be able to access the following URLs:  http://archive.ubuntu.com/ubuntu/ ...
  • Page 51: Configuring Static Ip Address For The Bmc

    This section describes how to set a static IP address for the BMC from the Ubuntu command line. Note: If you cannot access the DGX A100 System remotely, then connect a dis- play (1440x900 or lower resolution) and keyboard directly to the DGX A100 system To view the current settings, enter the following command.
  • Page 52: Configuring A Bmc Static Ip Address Using The System Bios

    DGX A100 System remotely. This process involves setting the BMC IP address during system boot. Connect a keyboard and display ( 1440 x 900 maximum resolution ) to the DGX A100 System, then turn on the DGX A100 System.
  • Page 53: Configuring Static Ip Addresses For The Network Ports

    CONFIGURING STATIC IP ADDRESSES FOR THE NETWORK PORTS During the initial boot setup process for the DGX A100 System, you had an opportunity to configure static IP addresses for a single network interface. If you did not set this up at that time, you can configure the static IP addresses from the Ubuntu command line using the following instructions.
  • Page 54: Switching Between Infiniband And Ethernet

    SWITCHING BETWEEN INFINIBAND AND ETHERNET The NVIDIA DGX A100 System is equipped with eight QSFP56 network ports on the I/O board, typically used for cluster communications. By default these are configured as InfiniBand ports, but you have the option to convert these to Ethernet ports.
  • Page 55: Starting The Mellanox Software Tools

    ConnectX6(rev:0) /dev/mst/mt4123_pciconf3 54:00.0 mlx5_3 net-ib3 ConnectX6(rev:0) /dev/mst/mt4123_pciconf2 4b:00.0 mlx5_2 net-ib2 ConnectX6(rev:0) /dev/mst/mt4123_pciconf1 14:00.0 mlx5_1 net-ib1 ConnectX6(rev:0) /dev/mst/mt4123_pciconf0 0d:00.0 mlx5_0 net-ib0 If necessary, start the mst driver. $ sudo mst start DGX A100 System User Guide DU-09821-001_v01 | 49...
  • Page 56: Determining The Current Port Configuration

    <device-path> corresponds to the port you want to configure <config-number> is ‘1’ for InfiniBand and ‘2’ for Ethernet. Example setting slot 0 to Ethernet $ sudo mlxconfig -y -d /dev/mst/mt4123_pciconf2 set LINK_TYPE_P1=2 Example setting slot 1 to InfiniBand DGX A100 System User Guide DU-09821-001_v01 | 50...
  • Page 57 Chapter 7 : Network Configuration $ sudo mlxconfig -y -d /dev/mst/mt4123_pciconf3 set LINK_TYPE_P1=1 DGX A100 System User Guide DU-09821-001_v01 | 51...
  • Page 58: Chapter 8 Configuring Storage

    SSDs are intended for application caching, so you must set up your own NFS storage for long term data storage. The following instructions describe how to mount the NFS onto the DGX A100 System, and how to cache the NFS using the DGX A100 SSDs for improved performance.
  • Page 59 /mnt is an example mount point. Verify caching is enabled. $ cat /proc/fs/nfsfs/volumes Look for the text FSC=yes in the output. The NFS will be mounted and cached on the DGX A100 System automatically upon subsequent reboot cycles. DGX A100 System User Guide DU-09821-001_v01...
  • Page 60: Chapter 9 Updating And Restoring The Software

    NVIDIA public repository. The process updates a DGX A100 system image to the latest QA’d versions of the entire DGX A100 software stack, including the drivers, for the latest update within a specific release; for example, to update to the latest Release 4.5 update from an earlier Release 4.5 version.
  • Page 61: Update Instructions

    For more information, see Introduction to Holding Packages on the Ubuntu Community Help Wiki. Perform the updates using commands on the DGX A100 console. Run the package manager. $ sudo apt update Check to see which software will get updated.
  • Page 62: Restoring The Dgx A100 Software Image

    Obtaining the DGX A100 Software ISO Image and Checksum File To ensure that you restore the latest available version of the DGX A100 software image, obtain the current ISO image file from NVIDIA Enterprise Support. A checksum file is provided for the image to enable you to verify the bootable installation medium that you create from the image file.
  • Page 63: Re-Imaging The System Remotely

    Click OK at the Power Control dialogs, then wait for the system to power down and then come back online. As the system boots, press [F11] when the NVIDIA logo appears to get to the boot menu. Browse to locate the Virtual CD that corresponds to the inserted ISO, then boot the system from it.
  • Page 64: Creating A Bootable Installation Medium

    Ensure that the following prerequisites are met:  The correct DGX A100 software image is saved to your local disk. For more information, see “Obtaining the DGX A100 Software ISO Image and Checksum File”...
  • Page 65: Creating A Bootable Usb Flash Drive By Using Akeo Rufus

    On a Windows system, you can use the Akeo Reliable USB Formatting Utility (Rufus) to create a bootable USB flash drive that contains the DGX A100 software image. Ensure that the following prerequisites are met:  The correct DGX A100 software image is saved to your local disk. For more information, see “Obtaining the DGX A100 Software ISO Image and Checksum...
  • Page 66: Re-Imaging The System From A Usb Flash Drive

    Connect a monitor and keyboard directly to the DGX A100 system. Boot the system and press F11 when the NVIDIA logo appears to get to the boot menu. Select the USB volume name that corresponds to the inserted USB flash drive, and boot the system from it.
  • Page 67: Retaining The Raid Partition While Installing The Os

    OS disk as well as the RAID disks. Since the RAID array on the DGX A100 system is intended to be used as a cache and not for long-term data storage, this should not be disruptive. However, if you are an...
  • Page 68 Chapter 9 : Updating and Restoring the Software $ sudo mount /raid Start the cache daemon. $ systemctl start cachefilesd These changes are preserved across system reboots. DGX A100 System User Guide DU-09821-001_v01 | 62...
  • Page 69: Chapter 10 Using The Bmc

    It monitors system sensors and other parameters. 10.1 CONNECTING TO THE BMC Make sure you have connected the BMC port on the DGX A100 system to your LAN. Open a browser within your LAN and go to: https://<bmc-ip-address>/ The BMC is supported on the following browsers: •...
  • Page 70 Chapter 10 : Using the BMC The BMC dashboard opens. Toolbar Display Area Menu Bar DGX A100 System User Guide DU-09821-001_v01 | 64...
  • Page 71: Overview Of Bmc Controls

    Displays inventory information of system modules: System, Processor, Memory Controller, BaseBoard, Power, Thermal, PCIE Device, PCIE Function, Storage. FRU Information Provides, chassis, board, and product information for each field- replaceable unit (FRU) device. DGX A100 System User Guide DU-09821-001_v01 | 65...
  • Page 72 Settings, Media Redirection Settings, Network Settings, PAM Order Settings, Platform Event Filter, Services, SMTP Settings, SSL Settings, System Firewall, User Management, Video Recording Remote Control Opens the KVM Launch page for accessing the DGX A100 console remotely. Power Control Perform the following power actions:...
  • Page 73: Common Bmc Tasks

    Select Settings from the left-side navigation menu. Select the User Management card. Click the Help icon (?) for information about configuring users and creating a password. Log out and then log back in with the new credentials. DGX A100 System User Guide DU-09821-001_v01 | 67...
  • Page 74: Using The Remote Console

    10.3.2 Using the Remote Console Click Remote Control from the left-side navigation menu. Click Launch KVM to start the remote KVM and access the DGX A100 console. 10.3.3 Setting Up Active Directory or LDAP/E-Directory From the side navigation menu, click Settings and then click External User Services.
  • Page 75: Configuring Platform Event Filters

    To view available configured slots, click Configured in the upper-left corner of the page.  To view available unconfigured slots, click UnConfigured in the upper-left corner of the page.  To delete an event filter from the list, click the x icon. DGX A100 System User Guide DU-09821-001_v01 | 69...
  • Page 76: Uploading Or Generating Ssl Certificates

    From the side navigation menu, click Settings and then click External User Services. Refer to the following sections for instructions on  Viewing the SSL Certificate  Generating an SSL Certificate  Uploading an SSL Certificate DGX A100 System User Guide DU-09821-001_v01 | 70...
  • Page 77 The View SSL Certificate page displays the basic information about the uploaded SSL certificate.  Certificate Version, Serial Number, Algorithm, and Public Key  Issuer information  Valid Date range  Issued to information DGX A100 System User Guide DU-09821-001_v01 | 71...
  • Page 78 Validity of the certificate. Enter a range from 1 to 3650 (days) Key Length The key length bit value of the certificate (Ex. 2048 bits) Click Save to generate the new certificate. DGX A100 System User Guide DU-09821-001_v01 | 72...
  • Page 79 Click the New Certificate folder icon, then browse to locate the appropriate file and select it. Click the New Private Key folder icon, then browse and locate the appropriate file and select it. Click Save. DGX A100 System User Guide DU-09821-001_v01 | 73...
  • Page 80: Chapter 11 Multi-Instance Gpu

    CHAPTER 11 MULTI-INSTANCE GPU Multi-Instance GPU (MIG) is a new capability of the NVIDIA A100 GPU. MIG uses spatial partitioning to carve the physical resources of a single A100 GPU into as many as seven independent GPU instances. These instances run simultaneously, each with its own memory, cache, and compute streaming multiprocessors.
  • Page 81: Viewing Available Profiles

    11.2.1 Viewing GPU Profiles You will need to know the profile ID when setting up MIG on a specific GPU. To view the available profiles (configurations), issue the following. $ nvidia-smi mig -i 0 -lgip Example output: +---------------------------------------------------------------------- | GPU instance profiles:...
  • Page 82: Viewing Compute Profiles

    <gpu-id> is the GPU device ID (0,1,2,3,4,5,6,7) <gpu-instance-id> is the GPU instance ID Example: The following example lists the available MIG compute profiles on GPU0/GPU instance ID 0: root# nvidia-smi mig -i 0 -gi 0 -lcip +----------------------------------------------------------------------------+ | Compute instance profiles: | GPU Profile...
  • Page 83: Creating A Gpu Instance

    GPU compute instances to further split its compute resources. 11.3.1 Creating a GPU instance Syntax nvidia-smi mig -i <gpu-id> -cgi <profile>[,<profile>...] where <gpu-id> is the GPU device ID (0, 1, 2, 3, 4, 5, 6, 7) <profile> is the MIG profile ID Example The following example creates a 7-slice GPU instance on GPU 0.
  • Page 84: Using Mig With Docker Containers

    To run containers, use the native GPU support provided with Docker 19.03 and later (included in the DGX OS installed on the DGX A100 system). Specify the GPU instance or compute instance using the “device=” parameter of the --gpu option as shown in the syntax below.
  • Page 85: Deleting Mig Instances

    (0 and 1) were created from it. Delete compute instance 1 on GPU instance 0 root# nvidia-smi mig -i 0 -gi 0 -ci 1 -dci Successfully destroyed compute instance ID 1 from GPU 0 GPU instance ID Delete compute instance 0 on GPU instance 0.
  • Page 86 Chapter 11 : Multi-Instance GPU root# nvidia-smi mig -i 0 -gi 0 -ci 0 -dci Successfully destroyed compute instance ID 0 from GPU 0 GPU instance ID 0 Delete GPU instance 0. root# nvidia-smi mig -i 0 -gi 0 -dgi...
  • Page 87: Chapter 12 Security

    12.1 USER SECURITY MEASURES The NVIDIA DGX A100 system is a specialized server designed to be deployed in a data center. It must be configured to protect the hardware from unauthorized access and unapproved use. The DGX A100 system is designed with a dedicated BMC Management Port and multiple Ethernet network ports.
  • Page 88: Secure Flash Of Dgx A100 Firmware

    12.3 SECURE DATA DELETION This section explains how to securely delete data from the NVIDIA DGX A100 system SSDs to permanently destroy all the data that was stored there. This performs a more secure SSD data deletion than merely deleting files or reformatting the SSDs.
  • Page 89: Instructions

    $ sudo nvme list Run nvme format -s1 on all storage devices listed. Syntax: $ sudo nvme format -s1 <device-path> where <device-path> is the specific storage node as listed in the previous step. DGX A100 System User Guide DU-09821-001_v01 | 83...
  • Page 90 Chapter 12 : Security DGX A100 System User Guide DU-09821-001_v01 | 84...
  • Page 91: Appendix Ainstalling Software On Air-Gapped Dgx A100 Systems

    Docker containers as well. INSTALLING NVIDIA DGX A100 SOFTWARE One method for updating DGX A100 software on an air-gapped DGX A100 system is to download the ISO image, copy it to removable media and then re-image the DGX A100 System from the media.
  • Page 92: Re-Imaging The System

    This process destroys all data and software customizations that you CAUTION: have made on the DGX A100 System. Be sure to back up any data that you want to preserve and push any Docker images that you want to keep to a trusted registry.
  • Page 93 $ sudo chown apt-mirror:apt-mirror /media/usb/repository Configure the path of the destination directory in /etc/apt/mirror.list and use the included list of repositories below to retrieve the packages for both Ubuntu base OS as well as the NVIDIA DGX OS packages: ############# config ################## set base_path...
  • Page 94: Configure The Target System

    The instructions in this section are to be performed on the target system. Prerequisites  The target DGX A100 system is installed, has gone through the first boot process, and is ready to be updated with the latest packages. ...
  • Page 95 If present, remove the file /etc/apt/sources.list.d/docker.list as it is no longer needed and it will eliminate error messages during the update process. Configure apt to use the NVIDIA DGX OS packages in the file /etc/apt/ sources.list.d/dgx-bionic-r418-cuda10-1-repo.list deb file:///media/usb/repository/mirror/ international.download.nvidia.com/dgx/repos/bionic/ bionic-...
  • Page 96: Installing Docker Containers

    $ sudo apt full-upgrade INSTALLING DOCKER CONTAINERS This method applies to Docker containers hosted on the NVIDIA NGC Container Registry, and requires that you have an active NGC account. On a system with internet access, log in to the NGC Container Registry by entering the following command and credentials.
  • Page 97 Load the NVIDIA Docker image. $ docker load –i framework.tar Verify the image is on your system. $ docker images DGX A100 System User Guide DU-09821-001_v01 | 91...
  • Page 98: Appendix Bsafety

    I components will void the UL Listing and other regulatory approvals of the product, and may result in noncompliance with product regulations in the region(s) in which the product is sold. DGX A100 System User Guide DU-09821-001_v01 | 92...
  • Page 99: Safety Warnings And Cautions

    Recycle the battery. The rail racks are designed to carry only the weight of the server system. Do not use rail-mounted equipment as a workspace. Do not place additional load onto any rail-mounted equipment. DGX A100 System User Guide DU-09821-001_v01 | 93...
  • Page 100: Intended Application Uses

    To remove power from system, you must unplug the AC power cord from the wall outlet. Make sure all AC power cords are unplugged before you open the chassis, or add or remove any non hot-plug components. DGX A100 System User Guide DU-09821-001_v01 | 94...
  • Page 101: System Access Warnings

    SYSTEM ACCESS WARNINGS Caution: To avoid personal injury or property damage, the following safety instructions apply whenever accessing the inside of the product:  Turn off all peripheral devices connected to this product. DGX A100 System User Guide DU-09821-001_v01 | 95...
  • Page 102: Rack Mount Warnings

    Install equipment in the rack from the bottom up with the heaviest equipment at the bottom of the rack. Extend only one piece of equipment from the rack at a time. DGX A100 System User Guide DU-09821-001_v01 | 96...
  • Page 103: Electrostatic Discharge (Esd)

    Use a conductive foam pad if available but not the board wrapper. Do not slide board over any surface. DGX A100 System User Guide DU-09821-001_v01 | 97...
  • Page 104: Other Hazards

    NICKEL NVIDIA Bezel. The bezel’s decorative metal foam contains some nickel. The metal foam is not intended for direct and prolonged skin contact. Please use the handles to remove, attach or carry the bezel. While nickel exposure is unlikely to be a problem, you should be aware of the possibility in case you’re susceptible to nickel-related reactions.
  • Page 105  Access is through the use of a TOOL or lock and key, or other means of security, and is controlled by the authority responsible for the location DGX A100 System User Guide DU-09821-001_v01 | 99...
  • Page 106: Appendix Ccompliance

    APPENDIX C COMPLIANCE The NVIDIA Luna Server is compliant with the regulations listed in this section. UNITED STATES Federal Communications Commission (FCC) FCC Marking (Class A) This device complies with part 15 of the FCC Rules. Operation is subject to the following...
  • Page 107: United States / Canada

    The Class A digital apparatus meets all requirements of the Canadian Interference- Causing Equipment Regulation. Cet appareil numerique de la class A respecte toutes les exigences du Reglement sur le materiel brouilleur du Canada. DGX A100 System User Guide DU-09821-001_v01 | 101...
  • Page 108 The full text of EU declaration of conformity is available at the following internet address: www.nvidia.com/support A copy of the Declaration of Conformity to the essential requirements may be obtained directly from NVIDIA GmbH (Bavaria Towers – Blue Tower, Einsteinstrasse 172, D-81677 Munich, Germany). DGX A100 System User Guide DU-09821-001_v01...
  • Page 109: Australia And New Zealand

    AUSTRALIA AND NEW ZEALAND Australian Communications and Media Authority This product meets the applicable EMC requirements for Class A, I.T.E equipment BRAZIL JAPAN Voluntary Control Council for Interference (VCCI) DGX A100 System User Guide DU-09821-001_v01 | 103...
  • Page 110 ハードディスクドライブ 除外項目 除外項目 機械部品 ( ファン、ヒートシンク、ベゼル ケーブル / コネクター 除外項目 はんだ付け材料 フラックス、クリームはんだ、ラベル、そ の他消耗品 注: 1.「0」は、特定化学物質の含有率が日本工業規格 JIS C 0950:2008 に記載されている含有率基準値より低いことを示します。 2.「除外項目」は、特定化学物質が含有マークの除外項目に該当するため、特定化学物質について、日本工業規格 JIS C 0950:2008 に基づく含有マークの表示が不要であることを示します。 3.「0.1wt% 超」または「0.01wt% 超」は、特定化学物質の含有率が日本工業規格 JIS C 0950:2008 に記載されている含有率基準値を超えて いることを示します。 DGX A100 System User Guide DU-09821-001_v01 | 104...
  • Page 111 JIS C 0950: 2008. 3. “Exceeding 0.1wt%” or “Exceeding 0.01wt%” is entered in the table if the level of the specified chemical substance exceeds the threshold level specified in the standard, JIS C 0950: 2008. DGX A100 System User Guide DU-09821-001_v01 | 105...
  • Page 112: South Korea

    Industrial (Class A) electromagnetic wave suitability equipment and seller or user should take notice of it, and this equipment is to be used in the places except for home. Korea RoHS Material Content Declaration DGX A100 System User Guide DU-09821-001_v01 | 106...
  • Page 113 DGX A100 System User Guide DU-09821-001_v01 | 107...
  • Page 114 DGX A100 System User Guide DU-09821-001_v01 | 108...
  • Page 115: China

    CHINA China Compulsory Certificate No certification is needed for China. The NVIDIA DGX A100 is a server with power consumption greater than 1.3 kW. China RoHS Material Content Declaration 产品中有害物质的名称及含量 The Table of Hazardous Substances and their Content 根据中国 《电器电子产品有害物质限制使用管理办法》...
  • Page 116 All parts named in this table with an “X” are in compliance with the European Union’s RoHS Legislation. 注:环保使用期限的参考标识取决于产品正常工作的温度和湿度等条件 Note: The referenced Environmental Protection Use Period Marking was determined according to normal operating use conditions of the product such as temperature and humidity. DGX A100 System User Guide DU-09821-001_v01 | 110...
  • Page 117: Taiwan

    C.10 TAIWAN Bureau of Standards, Metrology & Inspection (BSMI) 報驗義務人 : 香港商輝達香港控股有限公司台灣分公司 統一編號:80022300 臺北市內湖區基湖路8號. DGX A100 System User Guide DU-09821-001_v01 | 111...
  • Page 118 Taiwan RoHS Material Content Declaration DGX A100 System User Guide DU-09821-001_v01 | 112...
  • Page 119: Russia/Kazakhstan/Belarus

    Federal Agency of communication (FAC) This device complies with the rules set forth by Federal Agency of Communications and the Ministry of Communications and Mass Media Federal Security Service notification has been filed. DGX A100 System User Guide DU-09821-001_v01 | 113...
  • Page 120 LIFE CRITICAL APPLICATION). NVIDIA EXPRESSLY DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY OF FITNESS FOR SUCH HIGH RISK USES. NVIDIA SHALL NOT BE LIABLE TO CUSTOMER OR ANY THIRD PARTY, IN WHOLE OR IN PART, FOR ANY CLAIMS OR DAMAGES ARISING FROM SUCH HIGH RISK USES.

Table of Contents

Save PDF