NOTICE: The Processors Wiki will End-of-Life on January 15, 2021. It is recommended to download any files or other content you may need that are hosted on processors.wiki.ti.com. The site is now set to read only.

OpenMP Dispatch With OpenCL

From Texas Instruments Wiki
Jump to: navigation, search

An OpenCL program consists of a host program that executes on the host, and kernels that execute on OpenCL devices. A TI extension allows an OpenCL kernel to act as a wrapper that invokes C functions containing OpenMP regions. These three components — the host program, the OpenCL kernel, and the OpenMP region — form the main parts of an OpenCL program capable of dispatching OpenMP regions.

The diagram below provides an overview of how OpenMP regions are dispatched using OpenCL APIs. The host program is executed on SMP Linux running on the ARM Cortex-A15, while the OpenCL kernel and OpenMP region is executed on a OpenCL device consisting of the DSPs. The OpenCL kernel invokes the OpenMP main thread on DSP Core 0. When an OpenMP region is encountered while executing the main thread, DSP Core 0 along with the other DSPs work together to execute the OpenMP region.

TIOpenCLOpenMPDispatch.png


Structure of an OpenCL + OpenMP Application[edit]

This section uses the vecadd_openmp example in the OpenCL package to describes the structure of an OpenCL + OpenMP application.

Host Code[edit]

The host code is responsible for setting up and managing the execution of kernels on OpenCL devices.

A host code for a typical OpenCL + OpenMP application for 66AK2H starts with creating an OpenCL context. A context facilitates the creation of programs and kernels, creation of command queues, management of memory between the host and DSP, and submission of kernels for execution on the device.

The following host code for the example vadd_openmp application illustrates the responsibilities of the host code.

<syntaxhighlight lang="c"> ... float srcA [NumElements]; float srcB [NumElements]; float dst [NumElements]; float Golden[NumElements]; ...

// Create a context with the DSP device Context context(CL_DEVICE_TYPE_ACCELERATOR);

// Get information about the device associated with the context std::vector<Device> devices = context.getInfo<CL_CONTEXT_DEVICES>(); std::string str; devices[d].getInfo(CL_DEVICE_NAME, &str); cout << "DEVICE: " << str << endl << endl;

// Create input and output buffers Buffer bufA(context, CL_MEM_READ_ONLY, bufsize); Buffer bufB(context, CL_MEM_READ_ONLY, bufsize); Buffer bufDst(context, CL_MEM_WRITE_ONLY, bufsize);

ifstream t("vadd_wrapper.cl"); std::string kSrc((istreambuf_iterator<char>(t)), istreambuf_iterator<char>());

// Create a program from the kernel source code Program::Sources source(1, make_pair(kSrc.c_str(), kSrc.length())); Program program = Program(context, source);

// Compile the kernel source and link it with the specified object file program.build(devices, "vadd_openmp.obj");

// Specify which kernel from the program to execute Kernel kernel(program, "vadd_wrapper");

// Set the argument list for the kernel command kernel.setArg(0, bufA); kernel.setArg(1, bufB); kernel.setArg(2, bufDst); kernel.setArg(3, NumElements);

Event ev1,ev2,ev3,ev4,ev5,ev6,ev7,ev8;

// Create command queue using the context and device CommandQueue InO_Q(context, devices[d], CL_QUEUE_PROFILING_ENABLE);

// Enqueue commands to copy data into the input buffers InO_Q.enqueueWriteBuffer(bufA, CL_FALSE, 0, bufsize, srcA, NULL, &ev1); InO_Q.enqueueWriteBuffer(bufB, CL_FALSE, 0, bufsize, srcB, NULL, &ev2);

std::vector<Event> vec_ev5(1);

// Enqueue the kernel for execution as an OpenCL task InO_Q.enqueueTask(kernel, NULL, &vec_ev5[0]);

// Enqueue command to copy results from out of the output buffer InO_Q.enqueueReadBuffer(bufDst, CL_TRUE, 0, bufsize, dst, &vec_ev5, &ev6); </syntaxhighlight>

Kernels that invoke OpenMP applications must be enqueued to an in-order command queue. In-order execution serializes the execution order of commands in a command queue, and therefore ensures that only one "OpenMP" kernel is executed by a device at a time.

Observe that the host enqueues commands to write to bufA and bufB (from srcA and srcB residing on the host) before the enqueuing the kernel. This ensures that the data transfers complete before execution of the OpenCL kernel begins. When the kernel completes execution, the command to read from bufDst to dst (residing on the host).

For OpenCL + OpenMP applications, kernels are dispatched as OpenCL tasks. Only a single instance of the kernel is executed by one of the DSPs.

Kernel[edit]

Kernels in OpenCL + OpenMP applications are essentially wrappers that call functions containing OpenMP regions. In the vecadd_openmp example, the vadd_wrapper kernel calls vadd_openmp (which contains an OpenMP region) after initializing c[]. Note that the kernel is initially executed by a one DSP, until the OpenMP region is encountered.

<syntaxhighlight lang="c"> __kernel void vadd_wrapper(__global const float *a, __global const float *b, __global float *c, int size) {

 vadd_openmp(a, b, c, size);

} </syntaxhighlight>

C Function with OpenMP regions[edit]

OpenMP directives are used to express and guide parallelism. In the vadd_openmp function, the directive before the for loop indicates that the chunks of loop may be distributed and concurrently executed by multiple cores.

<syntaxhighlight lang="c"> void vadd_openmp(float *a, float *b, float *c, int size) {

 int i;
 #pragma omp parallel for
 for (i = 0; i < size; i++)
   c[i] = a[i] + b[i];

} </syntaxhighlight>

Note that the C6000 compiler currently supports OpenMP directives only for C code. OpenMP directives in C++ are not yet supported.

Makefile[edit]

The Makefile defines the rules for building and cleaning the executable. At the minimum, it specifies the host and device compilers, compiler flags, and the linker options.

For the vecadd_openmp example, the host code and the OpenMP code are compiled separately by the gcc and C6000 compilers, respectively. Notice that --omp compiler flag is specified for to enable OpenMP compilation for the OpenMP code.

<syntaxhighlight lang="bash">

  1. Host compiler and compile flags/options. Used to compile host code

CPP = g++ $(HOST_INCLUDE) CPP_FLAGS = -O3 HOST_INCLUDE = -I$(TI_OCL_INSTALL)/include

  1. Device OpenMP compiler and compile flags/options. Used to compile OpenMP code for device

CL6X = cl6x -mv6600 --abi=eabi $(DSP_INCLUDE) CL6X_FLAGS = -O1 --omp DSP_INCLUDE = -I$(TI_OCL_CGT_INSTALL)/include

  1. linker options and libraries for final executable

LIBS = -L$(TI_OCL_INSTALL)/lib -lOpenCL -locl_util

  1. rule for building final executable

EXE = vecadd_openmp $(EXE): main.o vadd_openmp.obj

       @$(CPP) $(CPP_FLAGS) main.o $(LIBS) -o $@
  1. rules for compiling host C/C++ code

%.o: %.cpp

       @$(CPP) -c $(CPP_FLAGS) $<
       @echo Compiling $<

%.o: %.c

       @$(CPP) -c $(CPP_FLAGS) $<
       @echo Compiling $<
  1. rule for compiling OpenMP C code for device

%.obj: %.c

       @$(CL6X) -c $(CL6X_FLAGS) $<
       @echo Compiling $<

</syntaxhighlight>

Examples[edit]

The OpenCL package is shipped with examples that use OpenCL to dispatch OpenMP. Please see OpenCL Demos for details on locating and building the examples. Examples that use OpenCL and OpenMP are:

Sample Application Details
vecadd_openmp This application computes the sum of two vectors. The host program manages data transfers, and dispatches an OpenCL wrapper kernel to the device. The OpenCL wrapper kernel calls the C function that has been compiled with OpenMP options. The OpenMP loop construct is used to distribute iterations of the computation to the DSP cores.
vecadd_openmp_t This application is similar to vecadd_openmp. The main difference w.r.t vecadd_openmp is that this example uses OpenMP tasks within the OpenMP parallel region to distribute computation across the DSP cores.
openmpbench_C_v3 This application is derived from EPCC OpenMP microbenchmarks, v3.0. The syncbench test was modified to dispatch using OpenCL

Guidelines for writing OpenCL + OpenMP applications[edit]

Remember that using OpenCL kernels to dispatch OpenMP regions is a TI-specific extension to OpenCL. The following are guidelines for writing applications in this mode.

  • The DSP's C6000 compiler currently supports the OpenMP 3.0 specification for C. OpenMP directives in C++ are not yet supported. See theOpenMP website for the complete C language specification for the OpenMP Application Program Interface Version 3.0.
  • OpenMP threadprivate variables are not supported.
  • Only one level of parallelism is supported by the OpenMP runtime. Therefore, nested parallel regions will be executed by teams comprising only one thread.
  • OpenMP environment variables are not supported and have no effect on the OpenMP runtime behavior.
  • OpenMP timing routines (omp_get_wtime, omp_get_wtick) assume that the DSP cores are running at 1GHz.
  • The current configuration of the OpenMP runtime reserves 1KB of memory for each core's stack. Since stack sizes are small, avoid placing large arrays/structs on the stack. Also, keep the call stack short to avoid stack overruns.
  • The current configuration of the OpenMP runtime reserves 80KB of memory for the shared heap. Since this heap size is quite small, avoid dynamic memory allocation (using malloc's). Alternatively, create OpenCL buffers on the host & pass them as arguments to kernels.
  • By default, global variables are placed in DDR.



E2e.jpg {{
  1. switchcategory:MultiCore=
  • For technical support on MultiCore devices, please post your questions in the C6000 MultiCore Forum
  • For questions related to the BIOS MultiCore SDK (MCSDK), please use the BIOS Forum

Please post only comments related to the article OpenMP Dispatch With OpenCL here.

Keystone=
  • For technical support on MultiCore devices, please post your questions in the C6000 MultiCore Forum
  • For questions related to the BIOS MultiCore SDK (MCSDK), please use the BIOS Forum

Please post only comments related to the article OpenMP Dispatch With OpenCL here.

C2000=For technical support on the C2000 please post your questions on The C2000 Forum. Please post only comments about the article OpenMP Dispatch With OpenCL here. DaVinci=For technical support on DaVincoplease post your questions on The DaVinci Forum. Please post only comments about the article OpenMP Dispatch With OpenCL here. MSP430=For technical support on MSP430 please post your questions on The MSP430 Forum. Please post only comments about the article OpenMP Dispatch With OpenCL here. OMAP35x=For technical support on OMAP please post your questions on The OMAP Forum. Please post only comments about the article OpenMP Dispatch With OpenCL here. OMAPL1=For technical support on OMAP please post your questions on The OMAP Forum. Please post only comments about the article OpenMP Dispatch With OpenCL here. MAVRK=For technical support on MAVRK please post your questions on The MAVRK Toolbox Forum. Please post only comments about the article OpenMP Dispatch With OpenCL here. For technical support please post your questions at http://e2e.ti.com. Please post only comments about the article OpenMP Dispatch With OpenCL here.

}}

Hyperlink blue.png Links

Amplifiers & Linear
Audio
Broadband RF/IF & Digital Radio
Clocks & Timers
Data Converters

DLP & MEMS
High-Reliability
Interface
Logic
Power Management

Processors

Switches & Multiplexers
Temperature Sensors & Control ICs
Wireless Connectivity