NOTICE: The Processors Wiki will End-of-Life on January 15, 2021. It is recommended to download any files or other content you may need that are hosted on processors.wiki.ti.com. The site is now set to read only.
MCSDK HPC 3.x MPI over Hyperlink
Open MPI over Hyperlink
Version 1.0.0.21
User Guide
Last updated: 09/18/2015
Contents
Introduction[edit]
Hyperlink is TI-proprietary high speed, point-to-point interface, with 4 lanes up to 12.5Gbps (maximum transfer of 4-4.5Gbytes/s). More information could be found at Hyperlink User Guide. For lower Hyperlink speed (4 lanes @3.125gbps), MPI benchmarks show BW of ~500MB/s (application level throughput).
Terminology[edit]
The following terminology is being used in this page
- Node: A K2H device.
- Cartridge: A group of interconnected nodes.
- Topology: A group of K2H nodes connected via Hyperlink in a particular fashion.
Topology[edit]
There are two Hyperlink ports per K2H node which lets it connect to two adjacent nodes via Hyperlink. Please note that it is not necessary to have same ports (0|1) on both end of connection. E.g. it is possible to have port0 of first K2H connected to port1 of second K2H. Here are few possible topologies based on Hyperlink transport interface:
- Two nodes connected using Hyperlink port 0
- Three nodes using both Hyperlink ports (0 and 1) between all 3 nodes, and allowing any-to-any connectivity.
- Four nodes using both Hyperlink ports (0 and 1), based on ring topology.
Communication between nodes[edit]
In 2 Node and 3-Node topologies, the communication between neighboring nodes is straightforward (neighboring node memory is mapped to local address space).
In case of ring topology with 4 nodes, it is possible to support communication between non-adjacent nodes ("diagonal") by virtue of overlapped remotely mapped memory regions. Interim node (e.g. node 2 between nodes 1 and 3) provides memory for data exchange without any software dedicated for this purpose on node 2. So, Hyperlink transport can be used for MPI communication between 4 nodes as well.
Communication between diagonal nodes (non-neighboring nodes) is possible only if neighboring node is also running same application (same "communication world").This type of communication is achieved by mapping a block of memory in neighboring node over both Hyperlink ports (no SW-assistance is needed in neighboring node). In order to keep communication balanced, message traffic is split: e.g. messages from node 1 to node 3 are sent through memory in node 2, whereas messages from node 3 to node 1 are sent through memory in node 4. Also messages from node 2 to node 4 are sent through memory in node 3, whereas messages from node 4 to node 2 are through memory in node 1.
Open MPI over Hyperlink[edit]
A new BTL module has been added to ti-Open MPI (Open MPI 1.7.1 based) to support transport over Hyperlink (via mpm-transport). MPI Hyperlink communication is driven by A15 only and uses the abstraction of low-level transport operations provided by mpm-transport library (MCSDK 3.0.4 and beyond). Hyperlink BTL support is seamlessly integrated into Open MPI run-time and is available after MCSDK-HPC installation.
Testing MPI over Hyperlink[edit]
Build the Nbody example[edit]
The nbody example's source code is available as a part of MPI installation. The source code is present in /usr/share/ti/examples/Open MPI/nbody. On one of the nodes, copy this source code in to the home directory and compile it
cd ~ cp -r /usr/share/ti/examples/Open MPI/nbody . cd ~/nbody make
Copy this nbody executable in to all the nodes participating in the test
scp -r ~/nbody $2@$1:~/
Change directory to the location of the nbody executable
cd ~/nbody/
Running the nbody example[edit]
Running on two nodes[edit]
A command line configuring same example for data exchange over Hyperlink, for 2 nodes with increased verbosity:
/opt/ti-Open MPI/bin/mpirun --mca orte_base_help_aggregate 0 --mca btl_base_verbose 100 --mca btl self,hlink -np 2 -host k2hnode1,k2hnode2 ./nbody 1000
For another pair of nodes (platform with 4 nodes and ring topology) with increased verbosity:
/opt/ti-Open MPI/bin/mpirun --mca orte_base_help_aggregate 0 --mca btl_base_verbose 100 --mca btl self,hlink -np 2 -host k2hnode1,k2hnode4 ./nbody 1000
For two nodes, doing performance testing involves running the mpptest application as below.
/opt/ti-Open MPI/bin/mpirun --mca btl self,hlink -np 2 -host c1n1,c1n2 ./mpptest -size 0 1048576 128 -sync -quick -logscale
Running on four nodes[edit]
For testing with 4 nodes run the following
/opt/ti-Open MPI/bin/mpirun --mca btl self,hlink -np 4 -host k2hnode1,k2hnode2,k2hnode3,k2hnode4 ./nbody 1000
Please note that optional MCA parameters "--mca orte_base_help_aggregate N1 --mca btl_base_verbose N2" could be appended to the mpirun command above and can be used to tune the verbosity.
MPI BTL over Hyperlink Implementation Details[edit]
MPI BTL HLINK is built on top of mpm-transport library provided in MCSDK (PDK component inside MCSDK).
A K2H device has 2 Hyperlink ports (0 and 1) allowing one SoC to connect directly with two neighboring SoCs. Daisy chaining is not supported by HW, but additional connectivity can be obtained by mapping common memory region in intermediate node (to adjacent devices using both Hyperlink ports of adjacent node). This method is used to create ring connection as described in above pictures (3 or 4 nodes).
Hyperlink transport interface enables access (optionally operated by EDMA) to connected SoC using memory mapped windows, and allowing native RDMA (data exchange initiated from one SoC, w/o any SW activity on connected device.
One device can create multiple (up to 64) windows of various sizes (64KB-4MB) into address space of remote device.
During initialization (MPI_Init()), mca_btl_hlink_component_init() is invoked which does probing of available Hyperlink devices - it tries to open both Hyperlink interfaces with timeout value of 2 seconds.
If remote node is also attempting to open (using mpm-transport-open) the Hyperlink interface, connection is established and several static memory mapped windows are opened, pointing to the memory of remote device:
- Remote MSMC region (mapped over Hyperlink) is 256KB big.
- Remote DDR3 region (mapped over Hyperlink) is 16MB big (up to 32 fragments, each 64KB big, can be stored in remote node without additional synchronization between the nodes).
MSMC region is used as mailbox (or FIFO), to indicate arrival of next fragment. These mailboxes are constantly polled (local reads) in mca_btl_hlink_component_progress() function, waiting for new arrival.
MPI BTL HLINK is mainly "pushing" data (writing) to remote SoC. Any reads are local (except for rare access to remote MSMC mailbox control structure to maintain back-pressure by checking on number of posted but unconsumed mails).
First, fragment (MPI is splitting longer buffer into multiple fragments - e.g. 64KB or 128KB) is first "pushed" to remote SoC DDR and, then, short mail is posted to remote SoC MSMC.
Up to 32 mails can be posted, before data "pushing" is paused (producer checks first if there is a room in remote mailbox/FIFO).
Data transfers are done first by memcpy-ing fragment data from Linux user space buffer to physically contiguous memory block (reserved MSMC memory scratch area), followed by A15 (or EDMA) transfer to remote SoC DDR3 memory. A15 memcpy() is used for transfers smaller than 16kbytes, and EDMA for bigger transfers. This is done both for direct and diagonal connections.
On RX side, an independent receiver thread is started to 'drain' incoming messages and move them from CMEM memory region (which is limited) to Linux user space buffers. This is important feature for handling bursts of non-blocking messages. This additional thread can be disabled with optional MCA parameter "--mca btl_hlink_rx_thread <0|1>" (default setting is '1'). By disabling this thread (set to 0), BW can be improved for ~10%, but maximum size of non-blocking message should be limited to 1MB. This is also necessary (to disabled RX thread), if EDMA for diagonal connections is used.
Known issues[edit]
- Only one MPI rank per K2H is allowed, i.e. multi-process/thread operation in this release is not supported.
- Hyperlink 12.5gbps operation is not reliable on some PCB designs.
Instead, 6.25gbps/"full" or 6.25gbps/"half" should be used (if failures are observed, please try reducing the speed first). - On EVM BoC setup, default /etc/mpm/mpm_config.json need to be modified for lower speed (from "12p5" to "6p25" and/or from "full" to "half" or "quarter"):
... { "name": "hyplnk0-remote", "transporttype": "hyperlink", "direction": "remote", "hyplnkinterface": "hyplnk0", "txprivid": 12, "rxprivid": 12, "rxsegsel": 6, "rxlenval": 21, "refclkMHz": "312p5", "serialrate": "6p25", "lanerate": "quarter", "numlanes": "4" }, { "name": "hyplnk1-loopback", "transporttype": "hyperlink", ... }, { "name": "hyplnk1-remote", "transporttype": "hyperlink", "direction": "remote", "hyplnkinterface": "hyplnk1", "txprivid": 12, "rxprivid": 12, "rxsegsel": 6, "rxlenval": 21, "refclkMHz": "312p5", "serialrate": "6p25", "lanerate": "quarter", "numlanes": "4" } ...
- If Hyperlink SerDes is enabled, SRIO connection may have failures (for certain PCB designs).