NOTICE: The Processors Wiki will End-of-Life on January 15, 2021. It is recommended to download any files or other content you may need that are hosted on processors.wiki.ti.com. The site is now set to read only.

OpenMP Accelerator Model 0.3.3

From Texas Instruments Wiki
Jump to: navigation, search


OpenMPTM Accelerator Model Version 0.3.3


User's Guide




TI defines the OpenMP Accelerator Model as the subset of OpenMP 4.0 specification that enables execution on heterogeneous System-on-chips (SoC) with host CPUs and one or more on-chip target accelerators. The host CPUs can offload computation (code and data) to target accelerators. On the TI 66AK2H SoC, the host is a Quad Core ARM Cortex-A15 cluster running SMP Linux. The target accelerator is a single cluster consisting of 8 C66x DSP cores. The OpenMP Accelerator model host runtime implementations uses the TI OpenCL Runtime as a back-end.


  • OpenMP 4.0 Constructs supported:
    • #pragma omp target
    • #pragma omp declare target
    • #pragma omp target data
    • #pragma omp target update
  • TI Extensions:
    • __TI_omp_device_alloc, __TI_omp_device_free
    • local annotation for map() clause


Supported OpenMP 4.0 Constructs[edit]

#pragma omp target[edit]

The target construct is used to specify the region of code that is executed on the target accelerator device. It also specifies data movement from host->device before execution of offloaded region and device->host after execution using map() clauses. The map clause requires a data transfer direction or data allocation parameter to be specified using 'to', 'from', 'tofrom', 'alloc' or 'local'.

<source lang="cpp" strict enclose="div"> float a[1024]; float b[1024]; float c[1024]; int size;

void vadd_openmp(float *a, float *b, float *c, int size) {

   #pragma omp target map(to:a[0:size],b[0:size],size) map(from: c[0:size])
   {
       int i;
       #pragma omp parallel for
       for (i = 0; i < size; i++)
           c[i] = a[i] + b[i];
   }

} </source>

  • Variables a, b, c and size initially reside in host (ARM Linux) memory
  • On encountering a target construct:
    • Space is allocated in device memory for variables a[0:size], b[0:size], c[0:size] and size
    • Any variables annotated ‘to’ are copied from host memory -> device memory
    • The target region is executed on the device
    • Any variables annotated ‘from’ are copied from device memory -> host memory
Mapping variables from host memory to device memory


#pragma omp declare target[edit]

The ‘declare target’ construct allows users to specify variables or functions that need to be mapped to the device and those that can be referenced in offloaded target regions. The OpenMP 4.0 Specification mandates that each function specified in the declare target region must have definitions for both the host and target accelerator device. In the code snippet below, function DSPF_sp_fftSPxSP is called from within a target region. It's prototype must be placed in a declare target region.

<source lang="cpp" strict enclose="div" header="" footer="">

  1. pragma omp declare target

/* There must be a host and accelerator target definition for this function */ void DSPF_sp_fftSPxSP(int N,

                     float *x, float *w, float *y,
                     unsigned char *brev,
                     int n_min, int offset, int n_max);
  1. pragma omp declare target end

void dsplib_fft(int N, int bufsize,

               float* x, float* w, float *y,
               int n_min, int offset, int n_max)

{

   #pragma omp target map(to: N, x[0:bufsize], w[0:bufsize], \
                              n_min, offset, n_max) \
                      map(from: y[0:bufsize])
   {
       DSPF_sp_fftSPxSP (N, x, w, y, 0, n_min, offset, n_max);
   }

} </source>

#pragma omp target data[edit]

The ‘target data’ construct can be used to create a device data environment by mapping host buffers to the target device before execution of an enclosed target region. The mapped buffers can be re-used by consecutive enclosed target regions and are valid for the entire scope of the 'target data' region. Nested 'target data' regions are also possible.

<source lang="cpp" strict enclose="div" header="" footer=""> /* Create device buffers for a, b, c and transfer data from Host -> Device for a,b */

  1. pragma omp target data map(to:a[0:size], b[0:size]) map(from:c[0:size])

{

  /* Existing device buffers are used and no data is transferred here */
  #pragma omp target
  {
      int i;
      #pragma omp parallel for
      for (i = 0; i < size; i++)
          c[i] += a[i] + b[i];
  }

} /* Device -> Host data transfer of buffer c is done here*/ </source>

#pragma omp target update[edit]

The ‘target update’ construct can be used to synchronize host or device buffers within a target data region as required. In the following example, buffer c is synchronized to the host in the first target update call and to the device in the second target update call. The update direction is specified using to and from clauses.

<source lang="cpp" strict enclose="div" header="" footer=""> void operate_on_host(int* buffer);

  1. pragma omp declare target

void operate_on_device_initial(int* srcA, int* srcB, int* result); void operate_on_device_final(int* srcA, int* srcB, int* result);

  1. pragma omp declare target end

/* Create device buffers for a, b, c and transfer data from Host -> Device for a,b */

  1. pragma omp target data map(to:a[0:size], b[0:size]) map(from:c[0:size])

{

  /* Existing device buffers are used and no data is transferred here */
  #pragma omp target
  {
     /* Here buffer c is modified with results while operating on input buffers a,b */
     operate_on_device_initial(a,b,c);
  }
  /* Copy modified target data to the host */
  #pragma omp target update from(c[0:size])
  ;
  /* Do some host side processing */
  operate_on_host(c);
  /* Synchronize c with target device data buffer */
  #pragma omp target update to(c[0:size])
  ;
  /* Offload more computation to target device */
  #pragma omp target
  {
     operator_on_device_final(a,b,c);
  }

} /* Device -> Host data transfer of buffer c is done here again*/ </source>

TI Extensions[edit]

Memory Management[edit]

On the TI 66ak2h SoC, the main memory is physically shared between host ARM CPUs and target DSP accelerators. In this shared address space, TI provides the following functions to allocate and free contiguous segments of memory that may be accessed by all ARM and DSP cores:

  • __TI_omp_device_alloc: Allocate a buffer in cmem device memory with given size and return a host (ARM Linux) pointer to it. <source lang="cpp" strict enclose="div" header="" footer=""> void* __TI_omp_device_alloc (int device_id, __TI_omp_mem_region mregion, int size); </source>
  • __TI_omp_device_free: Free cmem device memory with the given host pointer. <source lang="cpp" strict enclose="div" header="" footer=""> void __TI_omp_device_free (int device_id, void* buffer); </source>


Allocating chunks in shared memory avoids the expensive host->device and device->host copies that occur with map clauses. For more information on the device memory model, please refer to OpenCL Device Memory Model. Two different memory regions are supported. MSMC SRAM and DDR. Using __TI_omp_device_alloc, it is possible to allocate a buffer in the much faster MSMC Scratchpad RAM which is globally accessible by all DSP cores.

<source lang="cpp" strict enclose="div" header="" footer=""> /* Allocate buffer in device memory */ float* a = (float*) __TI_omp_device_alloc(0, TI_OMP_MEM_DDR, bufsize); // 128 MB

/* Initialize buffer on the host */ for (int i=0; i < NumElements; ++i) {

   a[i] = 1.0;

}


/* Map to is a cache write-back operation on host. No copy is performed

  Map from is a cache invalidate operation on host. No copy performed */
  1. pragma omp target map(to:a[0:size],size) map(from: a[0:size])

{

   int i;
   #pragma omp parallel for
   for (i = 0; i < size; i++)
       a[i] *= 2.0;

}

/* Free buffer */ __TI_omp_device_free(0, a); </source>

local annotation for map() clause[edit]

TI added the 'local' keyword that allows a chunk of data to be allocated in a DSP L2 Scratchpad RAM (SRAM). Using 'local' with host array section that is initialized on the host also implies that the data will not be copied over to DSP L2 memory, and only scratch space will be allocated. The scope of a 'local' mapped array section is the target region that it is mapped to.

<source lang="cpp" strict enclose="div">

/* A 100 byte scratch buffer */ char* scratch_buffer = malloc(100);

  1. pragma omp target map(tofrom: a[0:size]) map(local: scratch_buffer[0:100])

{

   /* a[] is copied to the Device */
   /* scratch_buffer[] is allocated in L2 SRAM,
      scratch_buffer[] is NOT copied to the Device */
   /* Perform operations on buffer a[] in DDR using the L2 SRAM scratch_buffer */
   operate_on(a, scratch_buffer);

} /* a[] is copied back to the Host,

    scratch_buffer[] is NOT copied back to Host */

</source>

Data movement operations[edit]

The table lists data movement operations that are performed by the Accelerator Model runtime for the various target constructs. This information is specific to TI's implementation of the OpenMP Accelerator Model.

Data Movement Operations
OpenMP 4.0 construct
target or target data map target update
Location of host variable To target From target To target From target
Linux space (heap/stack/global) memcpy from host memory to target memory memcpy from target memory to host memory memcpy from host memory to target memory memcpy from target memory to host memory
Allocated via __ti_omp_device_alloc ARM Cache Writeback of address range specified by the clause ARM Cache Invalidate of address range specified by the clause ARM Cache Writeback of address range specified by the clause ARM Cache Invalidate of address range specified by the clause


Known Issues & Limitations[edit]

OpenMP Specification Section 2.4 - Array Sections[edit]

  • Only [lower-bound : length ] syntax to map an array section is supported
  • Mapping array sections with non-zero lower-bound is not supported:

<source lang="cpp" strict enclose="div">

  1. pragma omp target map(to:a[0:size]) /* This is supported */

{

   /* Target code */

}

  1. pragma omp target map(to:a[10:size]) /* This is NOT supported */

{

   /* Target code */

} </source>

  • Mapping sub-array sections of mapped array sections is not supported

<source lang="cpp" strict enclose="div">

  1. pragma omp target data map(tofrom: a[0:size])

{

   /* This is supported */
   #pragma omp target
   {
       /* Target Code */
   }
   /* This is supported */
   #pragma omp target map(to: a[0:size])
   {
       /* Target Code */
   }
   /* This is NOT supported */
   #pragma omp target map(to: a[10:20])
   {
       /* Target Code */
   }

} </source>

  • The ‘declare target’ construct allows users to specify variables or functions that need to be mapped to the device and those that can be referenced in offloaded target regions. The 'target update' construct cannot be used on variables in declare target constructs.

OpenMP Specification Section 2.9 - Device Constructs[edit]

  • device() and if() clauses are not supported on target, target data and target update

TI OpenMP Accelerator Model Product[edit]

  • OpenMP 4.0 target constructs can only be used with C code
  • target update constructs must be followed by a semi colon on the next line

<source lang="cpp" strict enclose="div">

  #pragma omp target update from(c[0:size])
  ;

</source>

  • Creating libraries (static and dynamic) with target regions using CLACC is not supported
  • Host functions containing target regions should be written in a separate file. A file can contain multiple functions with target regions and multiple such files can be input to clacc.
  • The sum of sizes of all mapped array sections for a single target region must not exceed 1.5 Gb. This is a target DSP device restriction and cannot be surpassed. More information is available here.
  • Any GCC pragmas or attributes in host code compiled using CLACC are not retained
  • Multiple concurrent target regions are not supported. If host code is multi-threaded, it must be ensured that only one target region is active across the threads. For example, in an OpenMP parallel region, place the target region code in a critical section.
  • Avoid placing double precision floating point constants in host source code compiled using CLACC - there is a loss of precision with during the source to source translation performed by CLACC.

Example Applications[edit]

OpenMP Accelerator Model applications have the following source directory file structure:

  • Host C/C++ file(s)
    • File with main() function are named <example_name>_main.cpp/c
    • Files with host equivalents of functions used within target regions are named <example_name>_host.cpp/c
  • Target C file(s)
    • Files with #pragma omp target and #pragma omp declare target regions are named <example_name>_target.c
  • Makefile
    • Includes a top level make.inc file

At present target regions are only supported in C source files. However, host code can be written in C or C++ and handled accordingly in the Makefile.

The package contains the following examples:

Example Description
printf_debug

Use of printf() within OpenMP target region for debugging. This example also demonstrates use of OpenMP runtime library functions.

dotprod

Performs vector dot product using C66x intrinsic functions within target region. This example is written entirely in C. It also depends on the c_intrinsics_host_port package. Instructions in the Makefile.

dsplib_fft

This example demonstrates the use of C66x libraries with the OpenMP Accelerator Model. It has a target region which calls a DSPLIB library function.

edmamgr

Use of EDMA Manager library functions to do accelerated memory transfers within an OpenMP target region. EDMA Manager allows a user to perform 1D and 2D memory transfers using EDMA3 co-processors.

vecadd

Small vector addition using a single OpenMP parallel region offloaded to the DSP using #pragma omp target.

vecadd_t

Small vector addition using OpenMP tasks within an OpenMP parallel region offloaded to the DSP using #pragma omp target.

vecadd_big

Large Vector addition using __TI_omp_device_alloc extension to pre-allocate memory in shared ARM-DSP CMEM address space to eliminate memcpy overhead during OpenMP target region offload.

null

Measure time overhead associated with offloading a single target region from ARM->DSP. Demonstrates use of the __clock64() built-in function within a target region to count cycles elapsed on the target device

target_implicit_map

Demonstrates implicit mapping of array sections when offloading target regions within a '#pragma omp target data' region.

target_orphan_call

Demonstrates placing a target region within a function and calling this function from within a '#pragma omp target data' region.

target_update

Shows how to use the '#pragma omp target update' clause to synchronize host and device array sections within a '#pragma omp target data' region.

vecadd_mpi

Use of offloaded target regions along with MPI calls using the TI OpenMPI library to perform vector addition across multiple nodes.

openmpbench_C_v3

EPCC v3 microbenchmarks to evaluate performance of various OpenMP 3.0 constructs within an offloaded target region.



Building the application[edit]

The OpenMP Accelerator Model builds on TI's OpenCL product. In order to ensure that OpenCL is set up correctly, build and run any OpenCL example following the instructions specified in the OpenCL User's Guide. Set the following environment variables to the correct paths:

Environment Variable Value
TI_OCL_CGT_INSTALL Location of the TI CGT DSP compiler tools
TARGET_ROOTDIR Location of the K2H file-system if cross-compiling on x86/x86_64

If cross-compiling, source the ompacc_env.sh environment setup script provided in $TARGET_ROOTDIR/usr/share/ti/examples/openmpacc/.

TI provides a wrapper shell, ‘clacc’ that is used to build a host binary with embedded DSP binary. The Makefiles included in the example directory illustrate how to use clacc.

clacc wrapper compiler for OpenMP Accelerator Model compilation


  • Compile an example
    • $ cd $TARGET_ROOTDIR/usr/share/ti/examples/openmpacc/
    • $ cd vecadd
    • $ make
  • The Makefiles used in the examples have CL_OPTS set to -v -k by default, which shows debugging output and does not remove temporary files that are created respectively. After compiling an example the directory may be populated with temporary files: *.out __TI_CLACC_KERNEL.c *.asm *.cl *.dsp_h *.bc *.objc *.if *.map *.opt *.int.c *.o *.obj. Please remove -k option to disable keeping temporary files.
  • In order to enable display of runtime DEBUG information, add -d flag to CL_OPTS
  • In order to enable display of runtime PERFORMANCE information add -p flag to CL_OPTS


Running the application[edit]

In order to run an example program on a TI 66AK2H platform:

  • Ensure the OpenCL package is installed and an OpenCL example has been built and tested
  • Ensure the OpenMPAcc package is installed
  • Ensure the compiled example executable is present on platform file-system
  • Run executable


Change Log[edit]

Package Version Build Date Short Description Changes
0.3.3 04-25-2014 MCSDK GA Release
  • Fixed target region without map clause problem in parser, removed error description in readme
  • Added examples target_implicit_map, target_orphan_call, target_update, dotprod, openmpbench_C_v3
  • Peformance optimization: CMEM cacheWb and CMEM_cacheInv's done only when needed
  • Supports applications written entirely in C
0.2.2 02-26-2014 Patch release
  • Changed order of -L,-I,-l to match order in which user specifies it
  • MPAX problem fixed in OpenCL, TI_OCL_DSP_NOMAP removed
0.2.1 02-24-2014 Patch release
  • Documented non-zero array start problem in readme.txt
  • Added printf_debug example back, as bug fixed in OpenCL
  • Fixed CLACC multiple -l, -L, -I problem
  • Added edmamgr example
0.2.0 02-17-2014 First release First release


E2e.jpg {{
  1. switchcategory:MultiCore=
  • For technical support on MultiCore devices, please post your questions in the C6000 MultiCore Forum
  • For questions related to the BIOS MultiCore SDK (MCSDK), please use the BIOS Forum

Please post only comments related to the article OpenMP Accelerator Model 0.3.3 here.

Keystone=
  • For technical support on MultiCore devices, please post your questions in the C6000 MultiCore Forum
  • For questions related to the BIOS MultiCore SDK (MCSDK), please use the BIOS Forum

Please post only comments related to the article OpenMP Accelerator Model 0.3.3 here.

C2000=For technical support on the C2000 please post your questions on The C2000 Forum. Please post only comments about the article OpenMP Accelerator Model 0.3.3 here. DaVinci=For technical support on DaVincoplease post your questions on The DaVinci Forum. Please post only comments about the article OpenMP Accelerator Model 0.3.3 here. MSP430=For technical support on MSP430 please post your questions on The MSP430 Forum. Please post only comments about the article OpenMP Accelerator Model 0.3.3 here. OMAP35x=For technical support on OMAP please post your questions on The OMAP Forum. Please post only comments about the article OpenMP Accelerator Model 0.3.3 here. OMAPL1=For technical support on OMAP please post your questions on The OMAP Forum. Please post only comments about the article OpenMP Accelerator Model 0.3.3 here. MAVRK=For technical support on MAVRK please post your questions on The MAVRK Toolbox Forum. Please post only comments about the article OpenMP Accelerator Model 0.3.3 here. For technical support please post your questions at http://e2e.ti.com. Please post only comments about the article OpenMP Accelerator Model 0.3.3 here.

}}

Hyperlink blue.png Links

Amplifiers & Linear
Audio
Broadband RF/IF & Digital Radio
Clocks & Timers
Data Converters

DLP & MEMS
High-Reliability
Interface
Logic
Power Management

Processors

Switches & Multiplexers
Temperature Sensors & Control ICs
Wireless Connectivity