NOTICE: The Processors Wiki will End-of-Life on January 15, 2021. It is recommended to download any files or other content you may need that are hosted on processors.wiki.ti.com. The site is now set to read only.
OpenMP Accelerator Model User's Guide
OpenMPTM Accelerator Model v1.1.0
User's Guide
Contents
Introduction[edit]
The OpenMP ™ 4.0 specification enables the use of OpenMP on heterogeneous systems by adding support for a set of device constructs. The OpenMP community uses the term "OpenMP Accelerator Model" to refer to this set. OpenMP 4.0 defines a host device on which the OpenMP programs begin execution, and target devices onto which regions of code can be offloaded.
Texas Instruments' implementation of the OpenMP Accelerator Model targets the 66AK2H SoC. The 66AK2H SoC is composed of a ARM MPCore cluster with 4 Cortex-A15 CPUs (up to 1.4 Ghz) and an octa-core C66x DSP (up to 1.228 Ghz) cluster. The Cortex-A15 quad cores are fully cache coherent, while the DSP cores do not maintain cache coherency. The DSP cores do not have any virtual memory support as they do not have an MMU and do not share the ARM MMU. The DSP cores have 32KB of L1D and L1P and 1MB L2 cache each. Each of these caches are configurable and can be partitioned into scratchpad RAM (SRAM) as needed. The ARM cores also have 32 KB of L1D and L1P cache per core, but share a single 4 MB L2 cache. The Multicore Shared Memory Controller (MSMC) also provides 6 MB of SRAM which is shared by all ARM and DSP cores.
In this implementation, the host device is the ARM MPCore cluster running SMP Linux and the target device is the DSP cluster.
User's Guide[edit]
The User's Guide is located here.
Supported OpenMP Device Constructs[edit]
This product supports the the following OpenMP 4.0 device constructs and runtime library routines:
- #pragma omp target
- #pragma omp declare target
- #pragma omp target data
- #pragma omp target update
- omp_set_default_device()
- omp_get_default_device()
- omp_get_num_devices()
- omp_is_initial_device()
The following sections briefly introduce the supported constructs. Refer the OpenMP 4.0 specification for details.
#pragma omp target[edit]
The target construct is used to specify the region of code that should be offloaded for execution onto the target device. The construct also creates a device data environment by mapping host buffers to the target for the extent of the associated region. Map clauses on the target construct specify data movement from host to device before execution of the offloaded region, and device to host after execution of the offloaded region. Map types alloc, to, from and tofrom indicate data allocation and transfer direction.
<source lang="cpp" strict enclose="div"> float a[1024]; float b[1024]; float c[1024]; int size;
void vadd_openmp(float *a, float *b, float *c, int size) {
#pragma omp target map(to:a[0:size],b[0:size],size) map(from: c[0:size]) { int i; #pragma omp parallel for for (i = 0; i < size; i++) c[i] = a[i] + b[i];
}
} </source>
In the example above, variables a, b, c and size initially reside in host (ARM Linux) memory. Upon encountering a target construct:
- Space is allocated in device memory for variables a[0:size], b[0:size], c[0:size] and size.
- Any variables annotated ‘to’ are copied from host memory -> device memory.
- The target region is executed on the device. Note that #pragma omp parallel for is used to distribute iterations of the for loop across the 8 DSP cores.
- Any variables annotated ‘from’ are copied from device memory -> host memory.
To reduce the overhead of copying variables between host and target devices, refer the section on reducing offload overhead
#pragma omp declare target[edit]
The declare target construct specifies that variables and functions are mapped to a device. Each function specified in a declare target region must have definitions for both the host and target device. In the code snippet below, function DSPF_sp_fftSPxSP is called from within a target region. It's prototype must be placed in a declare target region.
<source lang="cpp" strict enclose="div" header="" footer="">
- pragma omp declare target
/* There must be a host and accelerator target definition for this function */ void DSPF_sp_fftSPxSP(int N,
float *x, float *w, float *y, unsigned char *brev, int n_min, int offset, int n_max);
- pragma omp end declare target
void dsplib_fft(int N, int bufsize,
float* x, float* w, float *y, int n_min, int offset, int n_max)
{
#pragma omp target map(to: N, x[0:bufsize], w[0:bufsize], \ n_min, offset, n_max) \ map(from: y[0:bufsize]) { DSPF_sp_fftSPxSP (N, x, w, y, 0, n_min, offset, n_max); }
}
</source>
#pragma omp target data[edit]
The target data construct creates a device data environment by mapping host buffers to the target for the extent of the associated region. Target regions may be executed within a device data environment. Furthermore, mapped buffers may be re-used by multiple enclosed target regions and are valid for the entire scope of the 'target data' region. Target data regions may also be nested.
<source lang="cpp" strict enclose="div" header="" footer=""> /* Create device buffers for a, b, c and transfer data from Host -> Device for a,b */
- pragma omp target data map(to:a[0:size], b[0:size]) map(from:c[0:size])
{
/* Existing device buffers are used and no data is transferred here */ #pragma omp target { int i; #pragma omp parallel for for (i = 0; i < size; i++) c[i] += a[i] + b[i]; }
} /* Device -> Host data transfer of buffer c is done here*/
</source>
#pragma omp target update[edit]
The target update construct is used to synchronize host or device buffers within a target data region as required. In the following example, buffer c is synchronized to the host in the first target update call and to the device in the second target update call. The update direction is specified using to and from clauses.
<source lang="cpp" strict enclose="div" header="" footer=""> void operate_on_host(int* buffer);
- pragma omp declare target
void operate_on_device_initial(int* srcA, int* srcB, int* result); void operate_on_device_final(int* srcA, int* srcB, int* result);
- pragma omp end declare target
/* Create device buffers for a, b, c and transfer data from Host -> Device for a,b */
- pragma omp target data map(to:a[0:size], b[0:size]) map(from:c[0:size])
{
/* Existing device buffers are used and no data is transferred here */ #pragma omp target { /* Here buffer c is modified with results while operating on input buffers a,b */ operate_on_device_initial(a,b,c); }
/* Copy modified target data to the host */ #pragma omp target update from(c[0:size])
/* Do some host side processing */ operate_on_host(c);
/* Synchronize c with target device data buffer */ #pragma omp target update to(c[0:size])
/* Offload more computation to target device */ #pragma omp target { operator_on_device_final(a,b,c); }
} /* Device -> Host data transfer of buffer c is done here again*/
</source>
TI Extensions[edit]
TI has implemented 2 extensions to the OpenMP Accelerator Model:
- An API to reduce the overhead of offloading target regions
- A local map type to map variables to target scratchpad memory - this memory has much lower access times compared to MSMC SRAM or DDR.
Reducing offload overhead[edit]
Data synchronization between the host and target device can be a significant source of overhead. This overhead has implications for the amount of computation that needs to be performed by a target region to outweigh the data synchronization overhead. On the 66AK2H SoC, the host and target device share internal and external memory. However:
- the target device does not have a memory management unit (MMU); and
- there is no hardware cache coherency between the target and host device.
As a result, the host accesses shared memory using virtual addresses and the target accesses the shared memory using physical addresses. Moreover, host device variables can span multiple non-contiguous pages in Linux virtual memory whereas the target device operates on contiguous physical memory. When mapping variables from the Linux process space, the variables must be copied into contiguous memory for target operation. This copy is inefficient, especially for large variables. To eliminate this copy, the implementation provides a special purpose dynamic memory allocation API, malloc ddr and malloc msmc. The physical memory associated with this heap is contiguous and is mapped to a contiguous chunk of virtual memory on the host. If any host variables allocated via this API are mapped into target regions, the map clauses translate to cache management operations on the host, significantly reducing the overhead.
The following TI-provided functions may be used to allocate and free contiguous segments of memory that may be accessed by all ARM and DSP cores:
- __malloc_ddr: Allocate a buffer in contiguous DDR memory
- __free_ddr: Free associated with __malloc_ddr
- __malloc_msmc: Allocate a buffer in contiguous MSMC memory
- __free_msmc: Free associated with __malloc_msmc
These functions have the same interface as standard malloc/free. Note that when using these functions within an OpenMP application, any required memory consistency operations are automatically handled by the supporting runtime systems.
<source lang="cpp" strict enclose="div" header="" footer=""> /* Allocate buffer in device memory */ float* a = (float*) __malloc_ddr(bufsize);
/* Initialize buffer on the host */ for (int i=0; i < NumElements; ++i) {
a[i] = 1.0;
}
/* Map to is a cache write-back operation on host. No copy is performed
Map from is a cache invalidate operation on host. No copy performed */
- pragma omp target map(to:a[0:size],size) map(from: a[0:size])
{
int i; #pragma omp parallel for for (i = 0; i < size; i++) a[i] *= 2.0;
}
/* Free buffer */
__free_ddr(a);
</source>
The table below lists data movement and cache operations that are performed by the Accelerator Model runtime for the various target constructs. This information is specific to TI's implementation of the OpenMP Accelerator Model.
OpenMP 4.0 construct | ||||
target or target data map | target update | |||
Location of host variable | To target | From target | To target | From target |
Linux paged memory (heap/stack/global) | memcpy from host memory to target memory | memcpy from target memory to host memory | memcpy from host memory to target memory | memcpy from target memory to host memory |
Allocated via __malloc_ddr/__malloc_msmc | ARM Cache Writeback of address range specified by the clause | ARM Cache Invalidate of address range specified by the clause | ARM Cache Writeback of address range specified by the clause | ARM Cache Invalidate of address range specified by the clause |
local map type[edit]
An additional local map type has been added, which maps a variable to the L2 scratchpad memory. In terms of data synchronization, such variables are treated as map-type alloc. They have an undefined initial value on entry to the target region and any updates to the variable in the target region cannot be reflected back to the host. Mapping host variables to target scratchpad memory provides significant performance improvements.
On the 66AK2H, each DSP core's 1MB L2 memory is configured as 128K cache, 768K scratchpad available to user programs and 128K reserved by the runtime. The 768K scratchpad is accessible via the local map type.
<source lang="cpp" strict enclose="div">
/* A 1KB scratch buffer */ char* scratch_buffer = malloc(1024);
- pragma omp target map(tofrom: a[0:size]) map(local: scratch_buffer[0:1024])
{
/* a[] is copied to the Device */ /* scratch_buffer[] is allocated in L2 SRAM, scratch_buffer[] is NOT copied to the Device */
/* Perform operations on buffer a[] in DDR using the L2 SRAM scratch_buffer */ operate_on(a, scratch_buffer);
} /* a[] is copied back to the Host,
scratch_buffer[] is NOT copied back to Host */
</source>
Dynamic Memory Management within Target Regions[edit]
Dynamic memory requests via malloc/calloc/realloc within OpenMP target regions are allocated from a heap managed by the Multicore DSP runtime. This heap is located in DDR, shared by all the DSP cores and is 8MB in size.
If the target region requires a heap >8MB or a heap in MSMC SRAM, the runtime provides additional APIs to initialize and manage such heaps. Descriptions and examples of use of the API are provided in DSP Heap Management.
Change Log[edit]
Package Version | Build Date | Short Description | Changes |
1.1.0 | 09-10-2014 | Production patch |
|
1.0.0 | 07-30-2014 | Production OpenMP Accelerator Model release |
|
0.3.3 | 04-25-2014 | MCSDK GA Release |
|
0.2.2 | 02-26-2014 | Patch release |
|
0.2.1 | 02-24-2014 | Patch release |
|
0.2.0 | 02-17-2014 | First release | First release |