Download Print this page

NEC 1000 Series Brochure & Specs page 5

Express5800/1000 series

Advertisement

VLC Architecture
High-speed / low latency Intra-Cell cache-to-cache data transfer
The Express5800/1000 series server
implements the VLC architecture, which
allows for low latency cache-to-cache
data transfer between multiple CPUs
within a cell.
In a split BUS architecture, for a cache-
to-cache data transfer to take place, the
data must be passed through a chipset.
However, in the VLC architecture,
data within the cache memory can
be accessed directly by one another,
bypassing the chipset. This allows
for lower latency between the cache
memory, which results in faster data
transfers.
Dedicated Cache Coherency Interface (CCI)
High-speed / low latency Inter-Cell cache-to-cache data transfer
Another technology implemented in the Express5800/1000 series
server to improve cache-to-cache data transfer is the Cache
Coherency Interface (CCI). CCI, the inter-Cell counterpart of the
VLC architecture, allows for a lower latency cache-to-cache data
transfer between Cells.
Information containing the location and state of cached data is
required for the CPU to access the specific data stored in cache
memory. By accessing the cache memory according to this
information, the CPU is able to retrieve the desired data.
Two main mechanisms exist for cache-to-cache data transfer
between Cells, directory based and TAG based cache coherency.
The cache information, described above, is stored in external
memory (DIR memory) for the directory based, and within the
chipset for the TAG based mechanisms.
In a directory based system, the requestor CPU will first access the
external memory to confirm the location of the cached data, and
then will access the appropriate cache memory. On the other hand,
in a TAG based system, the requestor CPU broadcasts a request to
all other cache simultaneously via TAG.
Crossbar-less configuration
Improved data transfer latency through direct attached Cell configuration
Within the Express5800/1000 series server lineup, the 1080Rf
has been able to lower the data transfer latency by removing the
crossbar and directly connecting Cell to Cell, and Cell to PCI box.
Very Large Cache (VLC) Architecture
Increased enterprise
CPU
CPU
CPU
CPU
applications
Cache
Cache
Cache
Cache
Memory
Memory
Memory
Memory
performance through
reduced cache memory
access latency
Memory
chipset
Direct CPU-to-CPU transfers
Intel
Itanium
2 processor
®
®
(Madison : L3 9MB)
Latency
High-speed
cache-to-cache
L3 of other CPU
CPU
Cache
Cache
Cache
L3
Memory
Memory
Memory
Data Size
Dual-Core Intel
®
Itanium
®
processor
(Montvale : L3 24MB)
Latency
L3 of other CPU
CPU
Cache
Cache
Cache
L3
Memory
Memory
Memory
The benefit of the TAG based mechanism, thus implemented in
the Express5800/1000 series server, is that by accessing the
TAG, unnecessary inquiries to the cache memory are filtered for a
smoother transfer of data. Furthermore, the Express5800/1000
series server includes a dedicated high-speed cache coherency
interface (CCI) which is used to connect the Cells directly to
one another without using a crossbar. This interface is used for
broadcasting and other cache coherency transactions to allow for
even faster cache-to-cache data transfer.
Tag Based Cache Coherency
Request is broadcasted to all CPU
simultaneously
CPU
CPU
CPU CPU
chip
set
Memory
TAG
The Express5800/1000 Series server
implements a dedicated connection (CCI)
for snooping
Directory Based Cache Coherency
Access Directory to confirm the location of
the data first, then access the appropriate
cache memory
CPU
CPU
CPU CPU
chip
Memory
set
Even with the crossbar-less configuration, virtualization of the Cell
card and I/O box has been retained as not to diminish computing
and I/O resources.
Split BUS Architecture
CPU
CPU
Cache
Cache
Memory
Memory
Data transfer controller
Memory
chipset
FSB
Intel
Itanium
2 processor
®
®
(Madison : L3 9MB)
Latency
L3 of other
CPU on
transfers
L3 of
different FSB
other CPU on
same FSB
CPU
Cache
Cache
Cache
L3
Memory
Memory
Memory
Dual-Core Intel
®
Itanium
®
processor
(Montvale : L3 24MB)
Latency
L3 of other CPU
on same FSB
CPU
Cache
L3
Memory
Data Size
This image does not depict actual numbers
A
Chipset
3
chip
CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
Directory Based Cache Coherency
chip
chip
chip
CPU
set
Memory
set
Memory
TAG
TAG
CPU
CPU
Memory
TAG
CPU CPU CPU CPU
CPU CPU CPU CPU
DIR
chip
chip
Memory
Memory
set
set
DIR
DIR
DIR
Higher cache memory
access latency.
CPU
CPU
Non-uniform
Cache
Cache
cache-to-cache data
Memory
Memory
transfer.
Inconsistent
performance.
Overhead from transferring
data through the chipset.
FSB
chipset
FSB
Latency
degradation
(approx 3x)
This area increases
due to the increase in
cache size and
Data Size
higher latency
Higher
L3 of other CPU on
latency
different FSB
(approx 3x)
Cache
Cache
Memory
Memory
Data Size
Performance
increase with
chip
chip
chip
CPU
CPU
set
set
set
set
the A
chipset
3
TAG
chip
chip
chip
chip
chip
Memory
CPU
CPU
set
set
set
set
set
set
DIR
CPU requesting the information
CPU storing the newest information
Memory that is storing location regarding
the memory
TAG memory (Manages cache line
information for all of the CPUs loaded on a
CELL card)
DIR Memory (Manages cache line
information for all of the memory loaded on
a CELL card)
5

Advertisement

loading