NOTICE: The Processors Wiki will End-of-Life on January 15, 2021. It is recommended to download any files or other content you may need that are hosted on processors.wiki.ti.com. The site is now set to read only.

MCSDK HPC 3.x OpenMPI

From Texas Instruments Wiki
Jump to: navigation, search

TIBanner.png


Open MPI Runtime

Version 1.0.0.21

User Guide

Last updated: 09/18/2015


Introduction[edit]

Open MPI is an open source, high-performance implementation of MPI (Message Passing Interface) which is a standardized API used for parallel and/or distributed computing. The current release is based on Open MPI 1.7.1 (www.open-mpi.org). MPI program allows concurrent operation of multiple instances of identical program on all nodes within "MPI Communication World". Instances of same program can communicate with each other using Message Passing Interface APIs.


MPI_Intro.png

Documentation & Tutorials[edit]

Good documentation on Open MPI could be found at this website [1] Also, video tutorials for Open MPI could be found here [2]. A lot of information is also present at [3]

TI's Open MPI enhancements[edit]

The TI version of Open MPI extends the Open MPI by supporting two additional hardware transport layers- SRIO and Hyperlink. This was achieved by enhancing the BTL (Byte Transport Layer) of the Open MPI with the support for SRIO and Hyperlink.
TI_Additions_Intro.png

More details on TI's Open MPI over SRIO can be found here [4]

More Details on TI's Open MPI over Hyperlink can be found here [5]

Cluster definition constraints[edit]

We specify cluster as collection of multiple K2H nodes, but certain constraints apply. As a rule of thumb it is highly advisable to populate complete cartridge (i.e. use all 4 nodes in cartridge) and also to select neighboring cartridges with many interconnect links.

  • TCP: Any combination of nodes can be used since each K2H node is connected to the router (star topology). But in case HLINK or SRIO connections are used, certain constraints apply:
  • HLINK: clusters with two or four nodes are allowed. If three nodes cluster is used, TCP BTL need to be specified to maintain link between edge nodes (those that are not immediate neighbors). If bigger cluster is used, pleas make sure that fully populated cartridges (all four nodes in a cartridge) are specified. In that case HLINK is always selected as it is highest performing transport interface.
  • SRIO: disjoint nodes in a cluster are not allowed. There has to be a physical SRIO connection between at least two nodes in the cluster. SRIO traffic is not routed via nodes that are not in the cluster. Please refer to the below picture for 2D-torus SRIO interconnect topology:

Moonshot-2D-torus-topology.png
Examples of good SRIO/HLINK cluster:

  • c1n1 c1n2 c1n3 c1n4 c2n1 c2n2 c2n3 c2n4
  • c1n1 c1n2 c1n3 c1n4 c4n1 c4n2 c4n3 c4n4

Example of bad SRIO/HLINK cluster:

  • c1n1 c1n2 c1n3 c1n4 c15n1 c15n2 c15n3 c15n4 (c1n nodes and c15n nodes cannot communicate over SRIO or Hyperlink)


HPC programming paradigms[edit]

Here is an overview of how Open MPI works across multiple TI keystone devices


Open MPI_TI_Implementation details.png


Open MPI Run time Parameters[edit]

This section details the Open MPI's run time environment and its parameters which could be useful for configuration and debugging.

MCA Parameters[edit]

MCA parameters are the basic unit of run-time tuning for Open MPI. Access to these parameters allows users to change internal Open MPI parameter values at run time. If a task can be implemented in multiple, user-discernible ways, it is optimal to implement as many as possible and make choosing between them be an MCA parameter.This is a service provided by the MCA base. It does not mean that they are restricted to the MCA components of frameworks. OPAL, ORTE, and OMPI projects all have “base” parameters. The MCA base allows users to be proactive and tweak Open MPI's behavior for their environment. It also allows users to experiment with the parameter space to find the best configuration for their specific system.

MCA Parameters Example and Usage[edit]

Here are two examples of MCA parameters usage

Get the MCA information[edit]

The ompi_info command can list the parameters for a given component, all the parameters for a specific framework, or all parameters.

Show all the MCA parameters for all components that ompi_info finds

/opt/ti-Open MPI/bin/ompi_info --all

Show all the MCA parameters for all BTL components

/opt/ti-Open MPI/bin/ompi_info –param btl all

Show all the MCA parameters for TCP BTL component

/opt/ti-Open MPI/bin/ompi_info –param btl tcp

MCA Usage[edit]

The mpirun command executes serial and parallel jobs in Open MPI

/opt/ti-Open MPI/bin/mpirun –mca orte_base_help_aggregate 0 –mca btl_base_verbose 100 –mca btl self, tcp –np 2 –host k2node1, k2node2 /home/mpiuser/nbody 1000

where /home/mpiuser/nbody is the job which will be run parallel on two nodes k2node1 and k2node2.

Some useful TI-Open MPI specific MCA parameters[edit]

  • Select BTL subset:
 --mca btl self,srio,hlink ... put list of BTLs you want to use
  • Disable Hyperlink RX thread:
 --mca btl_hlink_rx_thread 0 ... by default it is enabled (1), which allows larger non-blocking message sizes.
 But for non-blocking messages up to 2MB, and for 10-15% improved performance it is better to disable this thread. It significantly (20x) improves BW over diagonal links.
  • Increase SRIO PDSP CREDIT period:
 --mca btl_srio_pdsp_credit_period 4 ... by default it is set to 1, which means one credit packet is sent after each data message. 
 By increasing this period (2..4) SRIO BW is increased for multi-hop paths (allows message pipelining).


An MPI Example[edit]

One of the simplest Open MPI demo which runs on a cluster of A15s is testmpi. This example collects and distributes hostnames of nodes in MPI Communication World (group of nodes executing common program).

Building the testmpi example[edit]

Source code of testmpi can be found by installing ti-Open MPI source .

 apt-get source ti-Open MPI

The testmpi source file (testmpi.c) and makefile will be present at ti-Open MPI-1.0.0.21/ti-examples/testmpi

Now go to the directory and build the testmpi application

 cd ti-Open MPI-1.0.0.21/ti-examples/testmpi
 make

Make sure that the testmpi application is built and present at the same location in all the participating nodes before attempting to run.


Running the testmpi example[edit]

The testmpi application uses MPI APIs to initialize and set up MPI, print out Hello World message from A15 cluster, and finalize MPI. The testmpi demo is executed using standard MPI runtime commands. MPI offers many command line arguments to tune run-time behavior, like selection of verbosity or selection of transport interfaces.

     ~/ti-Open MPI-1.0.0.21/ti-examples/testmpi# /opt/ti-Open MPI/bin/mpirun --mca orte_base_help_aggregate 0  --mca btl self,tcp -np 2 -host k2hnode1,k2hnode1./testmpi

The command above runs the testmpi application over two K2H nodes, k2hnode1 and k2hnode2.

The output of the same would look like this

Hello world from processor k2hnode1, rank 1 out of 2 processors
locally obtained hostname k2hnode1
Hello world from processor k2hnode2, rank 0 out of 2 processors
locally obtained hostname k2hnode2


Selecting the BTL[edit]

Majority of MPI framework is agnostic of transport interface details. BTL modules (Byte Transfer Layer) provide implementation for various transport interfaces. Argument "--mca btl XXX,YYY,ZZZ" defines list of transport interfaces that will be used for data exchange during execution. Current implementation offers 5 (BTL) transport interfaces: loopback (self), shared memory (sm), TCP/IP (tcp), Serial RapidIO (srio) and Hyperlink (hlink).
In the above example, only self and TCP BTLs are used. MPI programming paradigm assumes that any node in communication world can communicate to another node in communication world. This is obvious for "star-like" topologies, e.g. nodes connected to the switch over Ethernet, or processes running on same node using shared memory transport interface. In general, unconstrained communication capability requires external switching resource. ti-Open MPI solution (based on Open MPI 1.7.1) provides two new BTLs: srio and hlink for K2H specific transport interfaces.

  • BTL Serial RapidIO (srio) allows MPI communication between nodes using on-chip (Navigator/PDSP) router [6]. More MPI over SRIO details provided here [7]
  • BTL Hyperlink allows MPI communication between nodes over Hyperlink interface. More details provided here [8].

Transport interfaces are always selected (if physical connection exists) in following way:

  • HLINK preferred over SRIO and TCP
  • SRIO or HLINK preffered over TCP.

In some (rare) cases, if cluster includes traffic with high number of hops, TCP might be preffered over SRIO. Number of hops and SRIO routing can be checked using routingTableGenTest utility (installed with mcsdk-hpc). Details are provided in [9]
Only one transport interface is selected for one pair of nodes, but one node can use all 3 interfaces for various nodes.
If SRIO or HLINK BTL is used, it is allowed to have only one MPI rank per K2H node.
But if TCP and SM are used, it is possible to have more than one rank per K2H node.

Note: Launching and initial interfacing (e.g. exchange of TCP ports) of all instances is handled by ORTED (Open MPI specific) process started typically using "SSH". Properly configured "SSH" is necessary (TCP/IP connectivity is needed independent of other available transport interfaces). Please note that IP and hostname (of same SoC) are treated as separate SSH entries (with keys typically copied using ssh-copy-id).


Code Walkthrough of the testmpi example[edit]

Here are some details regarding the MPI example code and its building details

Building testmpi[edit]

A closer look at the Makefile ( present at ti-Open MPI-1.0.0.21/ti-examples/testmpi) reveals that it exports the environment variables and defines the rule for building, installing, and cleaning the testmpi executable.

include ../../make.inc
export OPAL_PREFIX=${TARGET_ROOTDIR}/opt/ti-Open MPI
export PATH:=${TARGET_ROOTDIR}/opt/ti-Open MPI/bin:$(PATH)
export LD_LIBRARY_PATH=${TARGET_ROOTDIR}/opt/ti-Open MPI/lib:${TARGET_ROOTDIR}/lib
export C_INCLUDE_PATH=${TARGET_ROOTDIR}/opt/ti-Open MPI/include
EXECS=testmpi
MPICC=${TARGET_ROOTDIR}/opt/ti-Open MPI/bin/mpicc
all: ${EXECS}
testmpi: testmpi.c
      ${MPICC} -o testmpi testmpi.c
clean:
     rm ${EXECS}
$(TARGET_ROOTDIR)$(INSTALL_DIR)/testmpi:
     mkdir -p $(TARGET_ROOTDIR)$(INSTALL_DIR)/testmpi
install: $(TARGET_ROOTDIR)$(INSTALL_DIR)/testmpi
     cp ${EXECS} README $(TARGET_ROOTDIR)$(INSTALL_DIR)/testmpi


The mpicc is a wrapper compiler built on top of gcc, for compiling MPI programs.

Source Code Analysis[edit]

The source code for this testmpi example is testmpi.c. A closer look would reveal the below

 MPI_Init (&argc, &argv);  /* Startup */
 /* starts MPI */
 MPI_Comm_rank (MPI_COMM_WORLD, &rank);  /* Who am I?*/
 /* get current process id */
 MPI_Comm_size (MPI_COMM_WORLD, &size);/* How many peers do I have */
 /* get number of processes */
 {
    /* Get the name of the processor */
   char processor_name[320];
   int name_len;
   MPI_Get_processor_name(processor_name, &name_len);
   printf("Hello world from processor %s, rank %d out of %d processors\n",   processor_name, rank, size);
   gethostname(processor_name, 320);
   printf ("locally obtained hostname %s\n", processor_name);
 }
 MPI_Finalize(); /* Finish the MPI application and release sources*/


As you can see, MPI_* routines are used to run the example over multiple nodes (k2hnode1, k2hnode2 in our example). More details about MPI APIs could be found at [10]

Demos included in MCSDK-HPC distribution[edit]

HPC provides various Open MPI demos which run on an A15 cluster, as listed in the table below.

Sample Application Details
testmpi Basic MPI loopback test from node A to node B (good to verify A->B connectivity). Reports host names of connected nodes.
nbody Simplified 3D nbody example, up to 1000 particles
Open MPI_examples Open MPI examples including connectivity_c, hello_c, ring_c


MPI paradigm can be combined with OpenCL or OpenMPACC paradigm on the same node. MPI will be used for sharing workload between multiple SoCs (actually A15s), and within same SoC workload item is dispatched (from A15) to DSP. An example employing both paradigms is opencl+Open MPI demo available as a part of MCSDK-HPC. There are several additional examples explaining how to use Open MPI+openmpacc as well ( /usr/share/ti/examples/Open MPI+openmpacc).

References[edit]

Troubleshooting Guidelines[edit]

Dependencies[edit]

This version of Open MPI is dependent on other packages for proper operation. If installed as part of the mcsdk-hpc product, it is likely that these dependencies are already resolved. However, they are listed here for your knowledge and as a first step in any troubleshooting.

  • MPM-transport Module: This package allows the A15 to read/write the shared memory from user space. "ls -l /dev/dsp*" should list read/write permission for usr/group/other. Dynamic share library objects should be installed in /usr/lib ("ls -l /usr/lib/libmpmtransport.*so").
  • UIO kernel module: This package allows access to certain SoC configuration registers without having "root" privilages. "ls -l /dev/uio*" should list read/write permissions for devices /dev/uio8 and /dev/uio9. "lsmod" should indicate presence of "uio_module_drv"
  • Hyperlink LLD (low level driver) dynamic shared objects in /usr/lib of target file system ("ls -l libhyplnk.*so")
  • Password-less SSH communication between master (node which creates all slave processes) and slave nodes. This is typically achieved by copying keys to slave nodes, e.g. using ssh-copy-id SLAVE_NODE_ID
  • Create list of participating nodes, in /etc/hosts.

FAQ[edit]

E2e.jpg {{
  1. switchcategory:MultiCore=
  • For technical support on MultiCore devices, please post your questions in the C6000 MultiCore Forum
  • For questions related to the BIOS MultiCore SDK (MCSDK), please use the BIOS Forum

Please post only comments related to the article MCSDK HPC 3.x OpenMPI here.

Keystone=
  • For technical support on MultiCore devices, please post your questions in the C6000 MultiCore Forum
  • For questions related to the BIOS MultiCore SDK (MCSDK), please use the BIOS Forum

Please post only comments related to the article MCSDK HPC 3.x OpenMPI here.

C2000=For technical support on the C2000 please post your questions on The C2000 Forum. Please post only comments about the article MCSDK HPC 3.x OpenMPI here. DaVinci=For technical support on DaVincoplease post your questions on The DaVinci Forum. Please post only comments about the article MCSDK HPC 3.x OpenMPI here. MSP430=For technical support on MSP430 please post your questions on The MSP430 Forum. Please post only comments about the article MCSDK HPC 3.x OpenMPI here. OMAP35x=For technical support on OMAP please post your questions on The OMAP Forum. Please post only comments about the article MCSDK HPC 3.x OpenMPI here. OMAPL1=For technical support on OMAP please post your questions on The OMAP Forum. Please post only comments about the article MCSDK HPC 3.x OpenMPI here. MAVRK=For technical support on MAVRK please post your questions on The MAVRK Toolbox Forum. Please post only comments about the article MCSDK HPC 3.x OpenMPI here. For technical support please post your questions at http://e2e.ti.com. Please post only comments about the article MCSDK HPC 3.x OpenMPI here.

}}

Hyperlink blue.png Links

Amplifiers & Linear
Audio
Broadband RF/IF & Digital Radio
Clocks & Timers
Data Converters

DLP & MEMS
High-Reliability
Interface
Logic
Power Management

Processors

Switches & Multiplexers
Temperature Sensors & Control ICs
Wireless Connectivity