MCSDK Video

Version 2.2.0.38 and Earlier

Development Guide

Last updated: 04/29/2014

Introduction[edit]

Please note that this page is for MCSDK Video 2.2.0.38 and its earlier versions.

MCSDK Video DSP application builds on BIOS-MCSDK and executes highly compute intensive video processing on TI C66x multi-core DSPs. MCSDK Video 2.x release also provides a host application built on Linux Desktop SDK to allow a user friendly interface to run video demos on a Linux PC. As shown in the figure below, DSP and Host Processor share common header files for message interpretation and communication through host-DSP mailbox.

Intended audience for this document is the developers who are interested in the design of MCSDK Video host and DSP applications. Specifically, this MCSDK Video Development Guide provides information about

MCSDK Video Host Application Development
MCSDK Video DSP Development

MCSDK Video Host Application Development[edit]

As MCSDK Video host application is built upon Desktop Linux SDK, it is highly recommended to first go over the Development Duide for Desktop Linux SDK before continue. Specifically, this section will discuss the following design aspects of MCSDK Video host application.

Multiple Threads to Support Parallel Operation
Efficient and Fragmentation-Free Memory Management
Control Message Exchange via Pipes and Mailboxes
Scheduler Design Tailoring to Codecs
Operation Details for Threads
Example Memory Usage and Data Flow for H264HP Encoding and JPEG200 Decoding

Multiple Threads to Support Parallel Operation[edit]

MCSDK Video host application is a multi-threaded application so that DSPs run in parallel while host processor prepares subsequent data chunks for further video processing and/or consumes processed video data from the DSPs. In order to use all PCIe lanes concurrently, each device (DSP chip) is supplied with a Tx thread and a Rx thread. The following table lists the threads used in the MCSDK Video host application. For priority of threads, the higher the number, the higher the priority.

Thread	Description	Number of Thread	Priority
Input thread (Filer Reader)	Read input data from file into x86 input buffer	One per application	1
Output thread (Filer Writer or Display)	Write output data in x86 output buffer to file or display	One per application	4
DeviceIO Tx thread	Provide input data to DSP and notify DSP	One per device	2
DeviceIO Rx thread	Query DSP and get output data from DSP	One per device	3

Efficient and Fragmentation-Free Memory Management[edit]

Built on Desktop Linux SDK, MCSDK Video host application uses SDK's Buffer Manager and Contiguous Memory Driver components. The Contiguous Memory Driver is used to obtain physically contiguous memory on the host which can be mapped over the PCIe interface. The Buffer Manager is used to achieve fragmentation-free memory management for efficient data exchange between host and DSP through EDMA.

Memory Types and Mechanisms for Data Exchange[edit]

While MCSDK Video DSP application conducts highly compute intensive video processing, MCSDK Video host application sends input data to DSPs for processing and/or receives processed data from DSPs for file saving or display. Three types of memory can be involved for such host-DSP data exchange:

Contiguous x86 memory
C66x DDR memory
C66x PCIe memory (Contiguous x86 buffers are mapped to this region of DSP memory)

As in Desktop Linux SDK, the data exchange between host and DSP applications can be achieved in three ways.

Host processor's data buffer (i.e., x86 memory) is copied into DSP's DDR (i.e., C66x DDR memory) using EDMA. Since DDR memory is only shared between cores on a single DSP, this is beneficial when a buffer is only needed on a single DSP or DSP core..
DSP picks up the data from host memory buffer (x86 memory mapped to C66x PCIe memory) directly as the DSP consumes the data. This is used when a buffer needs to be shared concurrently with multiple DSPs.
Host processor copies data directly to DSP memory. This method has a much smaller bandwidth than the previous methods since the host processor is responsible for the transfer. This is typically used when a small amount of data needs to be transferred such as configuration buffers.

Multiple Types of Buffer Pools[edit]

MCSDK Video host application uses Buffer Manager component from Desktop Linux SDK to do fragmentation-free memory allocation. Six types of buffer pools are created and used in MCSDK Video host application.

Buffer Pool	Number of Pools	Chunk Size	Number of Chunks	Allocator Thread	Freer Thread
x86 Input Pool	One per application	4MB	32	File Reader	DeviceIO Tx or Rx
x86 Output Pool	One per application	4MB	80	DeviceIO Tx or Rx	File Writer
C66x DDR Input Pool	One per device	16MB	16	DeviceIO TX	DeviceIO Rx
C66x DDR Output Pool	One per device	16MB	16	DeviceIO TX	DeviceIO Rx
C66x PCIe Input Pool	One per application	16MB	2	DeviceIO Tx	DeviceIO Rx
C66x PCIe Output Pool	One per application	8MB	2	DeviceIO Tx	DeviceIO Rx

[note: need to add details on how the number of chunks and chunk sizes are determined; not all the buffer pools are used --> need to have specific use cases; need to avoid using DeviceIO "Tx or Rx"]

Frame Descriptor to Construct Frames from x86 Buffer Pools[edit]

In order to use EDMA for efficient data transfer, Contiguous Memory Driver from Desktop Linux SDK is used for allocating x86 input/output buffers in contiguous physical memory. Due to Linux restriction of not being able to allocate more than 4 MB physical contiguous memory, chunk size of x86 Input/Output pools is set to 4MB. Desktop Linux SDK can overcome this limit by configuraing the bootloader the reserve memory outside of the kernel, but this is unreliable in systems with a small amount of memory. Since the size of a contiguous buffer is constant, and the data unit in MCSDK video is video frames and the size can be large, a frame descriptor is introduced to allocate and group multiple x86 buffers for a single frame.

A frame descriptor contains a frame ID, a handle to the buffer pool, and an array of buffer descriptors which specifies the base and length of the buffers from the buffer pool. Using frame descriptor, a frame can be allocated across multiple buffers with arbitrary start and end offsets as shown in the example below. Further these discontinuous 4MB (or less than 4MB) buffers can be mapped to be visible as contiguous buffers in the DSP address space.

Control Message Exchange via Pipes and Mailboxes[edit]

The four threads of MCSDK Video host application exchange control messages through pipes and mailBox from Desktop Linux SDK to achieve seamless data exchange between the host processor and DSPs. There are two mailboxes for each DSP core, one for storing pending messages from host to DSP and another one for storing DSP to host pending messages.

When an input frame is read from file to the input buffer, the Input thread (Filer Reader) sends the frame information to DeviceIO Tx thread via writing to the input pipe.
DeviceIO Tx thread reads from the input pipe, transfers the frame data to DSP, and then communicates the input frame information to DSPs using mailbox.
DeviceIO Rx thread queries DSPs via mailbox to find a DSP-processed output frame. It transfers the output data to a contiguous x86 buffer and then sends the output frame information to the Output thread by writing to a Reorder Queue. Through the Reorder Queue, the processed frames from DSPs are reordered according to the frame ID, and written to the output pipe in the correct order.
Output thread (Filer Writer or Display) reads from the output pipe, and then save the output to disk or display.

Scheduler Design Tailoring to Codecs[edit]

In MCSDK Video application, there is a scheduler which keeps track of certain DSP cores (depending on scheduler topology as described below) and does task dispatch to those cores according to DSPs' task load.

Tailoring to specific codecs, the following three scheduler topologies can be supported along with anything in between.

Scheduler manages all DSP cores and dispatch tasks to all the cores. Example codec is JPEG2000 decoder.

Scheduler manages one master core in each DSP chip. High level tasks are dispatched by the scheduler, while master core in DSP internally dispatches tasks to slave cores. Example codec is AVCIU encoder.

Scheduler manages and delegates tasks to a single master DSP core. Master core internally dispatches tasks to slave cores. Example codec is H.264HP encoder.

Scheduler maintains task loading of its managed DSP cores using the number of queued tasks on these cores. At the initial phase, all the managed cores have zero task queued. When a task is assigned to a core, its task count is incremented by one. When processed output from a core is picked up, its task count is decremented by one.

According to the task loading maintained as above, the scheduler in MCSDK Video host application picks up the least busy core for task dispatch starting from the first core.

Operation Details for Threads[edit]

With major components described as above for the MCSDK Video host application, this section presents the operation details for the four threads of the video host application.

Input Thread (File Reader Thread)[edit]

Calling Sequence of Input Thread

Allocate x86 input buffer from input buffer pool and place into Frame Descriptor
Process iteration, file, and frame indices and open the correct input file
Read from input file into input buffer
If frame size met, or file ended, call scheduler to get destination node (i.e., DSP core), and send frame descriptor over the input pipe, else continue to allocate new input buffers

Configuration of Input Thread

Pipe descriptor
Buffer pool handle
Number of iterations, files, and frames
Size of frame to read before sending over pipe

DeviceIO TX Thread[edit]

Calling Sequence of DeviceIO TX Thread

Read message from the input pipe (Blocking); Pipe message contains frame descriptor, device ID, node ID, and frame ID.
Look up node in instance
Allocate Input DeviceIO descriptor (=> allocate device memory ), copy in frame descriptor
Allocate Output DeviceIO descriptor (=> allocate device memory )
If output device buffer is in PCIe memory, allocate Output frame descriptor
Provide input frame to DSP (map or DMA), and map output frame if needed.
If input device buffer is not in PCIe memory, free input frame descriptor
Compose mailbox message
Poll for DMA completion. Yield between polls
Send mailbox message to DSP node
If no free buffers, yield, then try again. (Similar to DMA polling)

[note: need to add details for device memory. need to have separate description for mapping and DMA]

DeviceIO RX Thread[edit]

Calling Sequence of DeviceIO RX Thread

Query mailboxes to find a node to unload
Read mailbox message
Find Input DeviceIO descriptor, based on freeBufIdD
If input was mapped to PCIe memory, free input frame descriptor buffers
Free Input DeviceIO descriptor and corresponding device buffer
Find Output DeviceIO descriptor based on outBufPtr
If output device buffer is not in PCIe space, allocate output frame descriptor.
Get output frame buffers from device. (Start DMA if needed, nothing if mapped)
Poll for DMA completion. Yield if not.
Call scheduler to unload node.
Place output frame in reorder queue.
Free Output DeviceIO descriptor, and corresponding device buffer.

[note: need to add details for device memory. need to have separate description for mapping and DMA]

Output Thread (File Writer or Display Thread)[edit]

Calling Sequence of Output Thread

Get frame descriptor from Reorder Queue
Process iteration, file, frame, and tile counts to get correct output file
Write frame to file and free frame descriptor buffers, or display frames

Configuration of Output Thread

Reorder queue handle
Number of iterations, files, and frames

Developing Video Demos on Quad C6678 PCIe card[edit]

This section discusses the memory usage when developing video demos on Quad C6678 PCIe card. It will also go over the data flow for H264HP encoding and JPEG2000 decoding demos.

Memory Usage on Quad C6678 PCIe card[edit]

The Quad C6678 PCIe card contains four C6678 DSP chips. As shown below, each core has its local L1 and L2 memory. Each chip has 4MB of MSMC and 1GB of dedicated DDR memory which is not available to the other chips. Each chip also has a 256 MB memory range which can be mapped to the memory of other devices connected through PCIe. In this demo, all C6678 devices map 128MB (0x60000000 – 0x67FFFFFF) of this memory space to the same global host x86 memory. This provides a global shared memory region which is available to both the host and all C6678 DSPs.

Furthermore, the Global Shared Memory is divided into multiple regions to allow data exchange between the x86 host and DSPs and communication between DSPs.

Shared Memory (0x60000000 – 0x61FFFFFF): this area is used for communication between DSPs. The IVIDMC multichip implementation uses this area for shared memory and software barriers.
Codec Output Scratch: (0x62000000 – 0x62FFFFFF): this area is divided between all cores and is used for the output of the encoding algorithm. Once all cores have finished processing, the master core will accumulate the outputs generated here to be placed in a host supplied buffer.
Host I/O Buffers ( 0x63000000 – 0x66FFFFFF): this area is owned by the host which will divide it into input and output buffers. The host writes data to the input buffer, supplies an input and output buffer pointer to the master core, and will read the output once the processing is complete.
DMA Buffers (0x67000000 – 0x67FFFFFF): this memory region is used by the host when it triggers DMA transfers into the DDR memory on the DSPs.

Data Flow of H264HP Encoding[edit]

Data flow of H264HP encoding is described as below. DSP operations are differentiated using the italic text.

Host [Input Thread] allocates an input buffer from the Contiguous x86 Input Buffer Pool.
Host [Input Thread] writes input frame to input buffer and send the buffer information over pipe to the DeviceIO Tx Thread.
Initially, all slave cores are waiting on a barrier for master core to enter the barrier while master core is polling on mailbox waiting to receive a host message requesting Encode.
Host [DeviceIO Tx Thread] allocates output buffers from the output c66 DDR Buffer Pool.
Host [DeviceIO Tx Thread] send message to master core with input and output buffer pointers and an input ID.
Master core writes input buffer pointer to Shared Memory region and enters the software barrier, thus releasing the slave cores from the barrier.
Slave cores acquire input pointer from Shared Memory region.
All cores exit out of Barrier and process the input and write their output to the Codec Output Scratch region.
Master core accumulates each core’s output and writes it to the output buffer pointer that was supplied by the host.
Master core sends a message to the host with output buffer pointer, output ID, and 0 or more input IDs corresponding to input buffers which can be freed.
Host [DeviceIO Rx Thread] queries master core's mailbox to get the message for the processed frame.
Host [DeviceIO Rx Thread] frees any input buffers (both x86 and c66 PCIe buffers) as indicated by the message from the master core.
Host [DeviceIO Rx Thread] allocates output x86 buffers. The number of buffers allocated is determined by the size of the output c66 DDR buffers.
Host [DeviceIO Rx Thread] triggers the DMA from the output c66 DDR buffer to the output x86 buffer.
Host [DeviceIO Rx Thread] polls for the DMA transfer to be complete.
Host [DeviceIO Rx Thread] frees the output c66 DDR buffer.
Host [DeviceIO Rx Thread] places output frame into the Reorder Queue.
Host [Output Thread] reads output from output x86 buffer pointer, frees output buffer.

Data Flow of JPEG2000 Decoding[edit]

Data flow of JPEG2000 decoding is described as below. DSP operations are differentiated using the italic text. [need to add details]

Host [Input Thread] allocates an input buffer from the Contiguous x86 Input Buffer Pool, and writes input frame to the input buffer
Host [Input Thread] gets the destination core and DSP from the Scheduler, and sends the buffer and desitnation core information over pipe to the DeviceIO Tx Thread corresponding to the designated DSP.
Initially, all cores are polling on mailbox waiting to receive a host message requesting Decode.
Host [DeviceIO Tx Thread] receives the message over pipe and allocates an input and output buffer from the c66 DDR Buffer Pools and triggers a DMA transfer from the input x86 buffer to the input c66 DDR buffer.
Host [DeviceIO Tx Thread] polls for the DMA transfer to be complete.
Host [DeviceIO Tx Thread] frees the input x86 buffer.
Host [DeviceIO Tx Thread] send message to the designated core with input and output buffer pointers and an input ID.
Core receives the mailbox message and decodes the input c66 DDR buffer and writes the output to the output c66 DDR buffer.
Core sends a message to the host with output buffer pointer, output ID, and 0 or more input IDs corresponding to input c66 DDR buffers which can be freed.
Host [DeviceIO Rx Thread] queries each core's mailbox to get the message for the processed frame.
Host [DeviceIO Rx Thread] frees the input c66 DDR buffer.
Host [DeviceIO Rx Thread] allocates output x86 buffers. The number of buffers allocated is determined by the size of the output c66 DDR buffers.
Host [DeviceIO Rx Thread] triggers the DMA from the output c66 DDR buffer to the output x86 buffer.
Host [DeviceIO Rx Thread] polls for the DMA transfer to be complete.
Host [DeviceIO Rx Thread] frees the output c66 DDR buffer.
Host [DeviceIO Rx Thread] places output frame into the Reorder Queue.
Host [Output Thread] reads output from output x86 buffer pointer, frees output buffer

MCSDK Video DSP Development[edit]

MCSDK Video DSP build has been developed to facilitate the development and testing of video, audio, and image Codecs. Standardized XDM compliant wrappers are provided to exercise codec APIs, with various pre-integrated video codecs as the example. Specifically, this section provides information about

Framework Folders and Make Instructions
Integrating New Codec in MCSDK Video DSP Build
Cache Usage
Multi Core Video Interface APIs

Framework Folders and Make Instructions[edit]

Framework Folders[edit]

The following table lists the folders required for building Codec Codec Framework. It also highlights the folders which users expect to change when plugging in a new XDM compliant codec in the framework.

Top Level Folder	Brief Description	Users Expected to Change?
\dsp
	Build related perl files	NO
	Bootloader shared information with the application	NO
	Build framework files (ggcfg)	YES
	Hardware Abstraction Layer	NO
	Memory management and utilities	NO
	Main Makefile (mkrel)	YES for setupenvMsys.sh
	Network Driver Interface	NO
	Codec integration framework (siu)	YES
	Build files (ggcfg\build\sv04)	YES
\components	Video components	NO
\inc	Include files to talk to various components	NO

Make Instructions[edit]

Before making the Codec Test Application, please refer to Getting Started Guide for details on installing the MCSDK Video and its required tools.

When it is desired to change one (or more) of the codec algorithms or add a new codec algorithm, the following steps are needed in order to change and rebuild the Codec Test Application.

Make a copy of dsp directory and modify desired files in their home directories and save.
Configure environment: the build directory is dsp\mkrel. Run the batch file "\dsp\mkrel\setupenvMsys.bat bypass" to configure the environment. The batch file calls "\dsp\mkrel\setupenvMsys.sh" which will check if all the required components and tools are available at the specified locations.
Run dsp/mkrel/makefile to build DSP code: make sv04.
The build procedure will produce directory: \dsp\mkrel\sv04 with the following files: sv04.out, sv04.map, readme_sv04.txt

Useful Tip

1. If a source debugger requires all source files to be combined into a single directory, "FLAT=YES" may be added in the make command line, which will create the directory mkrel\sv04\flat containing all source and header files used for the build.
2. After making the first build, if there is no source file change in \components directory, "RTSCBYPASS=YES" may be added in the make command line, which will bypass compiling the components. If there is no source file change in dsp\ggcfg\build\hdg\sv04\bios directory, "BIOSCFGPKGBYPASS=YES" may be added in the make command line, which will bypass compiling the BIOS configuration package.

Integrating New Codec in MCSDK Video DSP Build[edit]

In MCSDK Video 2.1, all the C6678 codec available at C6678 Codecs have been integrated. In MCSDK Video 2.2, HEVC encoder and HEVC decoder are further integrated. Please follow the following steps to add a new codec.

Codec Algorithm[edit]

New Codec can be plug and play in the existing sv04 infrastructure if it is XDM 0.9 or XDM 2.0 compliant encoder or XDM 1.3 compliant decoder. For the product release of MCSDK Video, it further supports XDM 1.0 compliant video encoder and XDM 1.0 compliant video decoder, as well as XDM 1.0 compliant image encoder and XDM 1.0 compliant image decoder. If the codec is not compliant with any of the above XDM versions, XDM wrapper needs to be created using the existing three XDM wrapper as references.

Codec algorithm source code can be compiled either via make or CCS to make a library. Once the Codec library with public API files is ready, need to add its environment to "\dsp\mkrel\setupenvMsys.sh" following the example below.

# H264BP encoder

VIDEO_H264_ENC_VERSION="C66x_h264venc_01_24_00_01_ELF"
VIDEO_H264_ENC_RUNPATH="$VIDEOBASE/$VIDEO_H264_ENC_VERSION"
make_shortname "VIDEO_H264_ENC_RUNPATH"
VIDEO_H264_ENC_SRCPATH="$VIDEO_H264_ENC_RUNPATH"
check_exist "VIDEO_H264_ENC_SRCPATH" "/packages/ti/sdo/codecs/h264venc/ih264venc.h"
COPY_TOOLS_LIST="$COPY_TOOLS_LIST VIDEO_H264_ENC"
...
export VIDEO_H264_ENC_DOS="`echo $VIDEO_H264_ENC_RUNPATH | $make_dos_sed_cmd`/packages"

Then, add the new codec path to "FXDCINC" in "\dsp\mkrel\c64x\makedefs.mk", following any of the video codecs as the example from the code below:

FXDCINC = $(GGROOT)/mkrel/rtsc;$(TOOLSXDCDOS)/packages;$(BIOS_MCSDK_PDK6678_DOS);$(NWAL_C66X_DOS);$(VIDEO_MPEG4_ENC_DOS);$(VIDEO_MPEG4_DEC_DOS);$(VIDEO_MPEG2_DEC_DOS);$(VIDEO_H264_ENC_DOS);$(VIDEO_H264_DEC_DOS);$(VIDEO_H264HP_DEC_DOS);$(VIDEO_AVCIU_ENC_DOS);$(VIDEO_J2K_ENC_DOS);$(VIDEO_J2K_DEC_DOS);$(DSP_INSTALL_DIR_DOS)

Codec Client Glue Code[edit]

In order to glue a new codec to the framework, the following files must be written and placed as indicated:

Decoder: vct<codec>DecClient.c and vct<codec>DecClient.h files in siu\vct\codec\decoder\codec folder
Encoder: vct<codec>EncClient.c and vct<codec>EncClient.h files in siu\vct\codec\encoder\codec folder

JPEG2K decoder and H.264BP encoder client code can be referred to as example. <codec>encAPI and <codec>decAPI for the new codec must be defined according to the corresponding data structures defined in siuVctCodecAPI.h.

Update siuVctSupportedCodecs.c[edit]

Edit the file siuVctSupportedCodecs.c. Include the Client Glue code header file vct<codec>EncClient.h or vct<codec>DecClient.h and update the data structure encoderAPI_t or decoderAPI_t with the new API definition.

Edit makefile[edit]

siu\c64x\make\makefile must be edited to include the new header files and also compile the client glue code. Search for “avciu” in this makefile and make similar changes

Update Linker Command File[edit]

ggcfg\build\hdg\sv04\ggvf0.becmd must be updated to include the new Codec library

Create Config Files[edit]

Create multiClip.cfg and codecParams.cfg for the codec (under siu/vct/testVecs/<codec>/config), and update the codecName in the codecParams.cfg with the string name that is used in the Glue Code. To test the newly added codec, modify siu/vct/testVecs/testVecs.cfg to update the path.

Cache Usage[edit]

The following describes the Cache configuration in sv04 build and the APIs available to perform cache operations such as wirteback and invalidate.

Cacheability of DDR[edit]

Sections of DDR can be made cacheable or non-cacheable, prefetchable or non-prefetchable using the MAR registers and the granularity is 16MB. For the entire DDR of 512 MB on Shannon (0x80000000 – 0x9FFFFFFF), One 16MB DDR section (0x80000000 – 0x80FFFFFF) is configured as non-cacheable and non-prefetchable while all the other DDR sections are cacheable and prefetchable. The only non-cacheable and non-prefetchable DDR section is used for multi-core synchronization buffers for multi-core codecs. The MAR definition can be found from “TMS320C66x DSP CorePac User Guide” (Literature Number: SPRUGW0A). The DDR Cache setting can be changed by editing the file ggcfg\build\hdg\sv04\ggmemdef.bec in the structure vigdkMemoryContext_t.

Cacheability of L1P, L1D and L2[edit]

Cache configuration for L1P, L1D and L2 is done in the file ggcfg\build\hdg\sv04\ gghwcfg.h. Recommended configuration is to use HAL_L1P_CACHE_32K, HAL_L1D_CACHE_32K and HAL_L2_CACHE_64K.

Cache Control APIs[edit]

The following functions (defined in dsp\siu\osal\bios6\siuOsalBios6.c) are available to perform various cache operations. They are wrappers calling BIOS 6 Cache APIs.

void siu_osal_wbinv_cache_all(void);
void siu_osal_wbinv_cache(void *base, tulong length, bool wait);
void siu_osal_inv_cache(void *base, tulong length, bool wait);
void siu_osal_wb_cache(void *base, tulong length, bool wait);

Multi Core Video Interface APIs[edit]

It is possible that a single DSP core (or sometimes even all the DSP cores in a single chip) may not have enough horse power to achieve real-time encode/decode. For e.g. HD resolution Encoding of H.264 takes up more than one core. HEVC Encoding at HD resolutions with broadcast quality will take up more than one chip. In such situations, we partition the encode/decode to concurrently run on multiple cores/chips. There are multiple ways to partition an algorithm and it depends upon the algorithm design. Typically, it involves either data partition (each core runs same code, but on different data) or function partition (pipeline processing, where core A runs function A, passes the results to core B that runs function B and so on) or a combination of both.

In any of the partitioning techniques, there are a few communication primitives that are necessary for multicore/multichip communication. They are: Barriers (for fork-join operations), Locks (for global semaphores), Map and Sync (for accessing/updating shared physical memories), and Mailboxes (for point to point communication). Depending on the algorithm and partitioning technique, some or all of the above mentioned primitives would be used. The implementation of these primitivies use either shared memory or PCIe, IPC registers etc which are all platform dependent. Our intent is that the codec be agnostic to the implementation of these primitives and use these in an abstract way. The MCSDK-Video application/framework would hook-up/supply the implementation for these primitives via what we call IVIDMC interface. In video codec creation phase, the actual implementation of the IVIDMC APIs is passed through codec static parameters as a set of function pointers. Then, video codec is internally using only these APIs to do inter-core/inter-chip sync-up and/or communication.

MCSDK Video provides two versions of multi-core video interface (ividmc) APIs: ividmc and ividmc3. ividmc supports multi-core within a single chip, while ividmc3 supports multi-core and multi-chip. In other words, ividmc3 is a superset of ividmc. We are maintaining ividmc for legacy reasons (H264 uses ividmc). HEVC uses ividmc3. Their corresponding API files 'ividmc.h’ and 'ividmc3.h’ are located in dsp\siu\ividmc folder. The following describes some details of ividmc3 APIs.

Overall structure of ividmc3 APIs is as follows:

typedef struct IVIDMC3_s {
XDAS_Void *(*keyCreate) (XDAS_UInt8 *name, XDAS_Int32 user_id, IVIDMC3_KEY_SPACE_e key_space,
             XDAS_Int32 num_users, XDAS_Int32 *user_ids, IVIDMC3_KeyCfg_t *cfg);
XDAS_Int32 (*barrWait)(XDAS_Int32 user_id, XDAS_Void *barrHandle);
XDAS_Int32 *(*shmMap) (XDAS_Int32 user_id, XDAS_Void *shmemHandle);
XDAS_Int32 (*shmSync)(XDAS_Int32 user_id, XDAS_Void *shmemHandle, XDAS_Int32 *shmem_base,
          XDAS_Int32 shmem_size, IVIDMC3_SYNC_ATTRIBS shmem_sync_attribs);
XDAS_Int32 (*shmSyncWait)(XDAS_Int32 user_id, XDAS_Void *shmemHandle, XDAS_Int32 *shmem_transid);
XDAS_Int32 (*lockAcquire)  (XDAS_Int32 user_id, XDAS_Void *lockHandle);
XDAS_Int32 (*lockRelease)  (XDAS_Int32 user_id, XDAS_Void *lockHandle);
XDAS_Int32 (*lockCheck)    (XDAS_Int32 user_id, XDAS_Void *lockHandle);
XDAS_Int32 (*mailBoxOpen)(XDAS_Void *mailBoxHandle);
XDAS_Int32 (*mailBoxWrite) (XDAS_Void *mailBoxHandle, XDAS_UInt8 *buf, XDAS_UInt32 size, XDAS_UInt32 trans_id);
XDAS_Int32 (*mailBoxRead) (XDAS_Void *mailBoxHandle, XDAS_UInt8 *buf, XDAS_UInt32 *size, XDAS_UInt32 *trans_id);
XDAS_Int32 (*mailBoxQuery) (XDAS_Void *mailBoxHandle);
XDAS_Int32 num_users;
IVIDMC3_TASK_e task_ID;
XDAS_Int32 user_id;
} IVIDMC3_t

Brief description of each item is as below:

XDAS_Int32 user_id:
core ID, including core ID inside the chip and also the chip ID
IVIDMC3_TASK_e task_ID:
core task identification, can be global master, chip local master, or slave
num_users:
number of cores in the team, i.e., the number of cores running together for processing the same stream
XDAS_Void *(*keyCreate) (XDAS_UInt8 *name, XDAS_Int32 user_id, IVIDMC3_KEY_SPACE_e key_space, XDAS_Int32 num_users, XDAS_Int32 *user_ids, IVIDMC3_KeyCfg_t *cfg):
key assignment for Barrier, Shmem, Locks and Mailboxes; Barrier, Shmem and Locks resources are also initialized.
XDAS_Int32 (*barrWait)(XDAS_Int32 user_id, XDAS_Void *barrHandle):
function call which will block until all other users have also called it
XDAS_Int32 *(*shmMap) (XDAS_Int32 user_id, XDAS_Void *shmemHandle)
dynamically allocates shared memory region identified with handle "shmemHandle"; if region with same "shmemHandle" is already allocated, its base address is provided
XDAS_Int32 (*shmSync)(XDAS_Int32 user_id, XDAS_Void *shmemHandle, XDAS_Int32 *shmem_base, XDAS_Int32 shmem_size, IVIDMC3_SYNC_ATTRIBS shmem_sync_attribs):
synchronizes shared memory region
XDAS_Int32 (*shmSyncWait)(XDAS_Int32 user_id, XDAS_Void *shmemHandle, XDAS_Int32 *shmem_transid):
waits for synchronization operation to finish
XDAS_Int32 (*lockAcquire) (XDAS_Int32 user_id, XDAS_Void *lockHandle):
critical region lock acquire in multi-core environment; if spinlock has been succesfully acquired, this function returns 1; if same spinlock has been already acquired by another participant, this function returns 0
XDAS_Int32 (*lockRelease) (XDAS_Int32 user_id, XDAS_Void *lockHandle):
critical region lock release in multi-core environment
XDAS_Int32 (*lockCheck) (XDAS_Int32 user_id, XDAS_Void *lockHandle):
critical region lock check to see if any of the users currently have acquired the lock
XDAS_Int32 (*mailBoxOpen)(XDAS_Void *mailBoxHandle):
opens a mailBox; this is a blocking call, it waits untill the mailBox creater is ready
XDAS_Int32 (*mailBoxWrite) (XDAS_Void *mailBoxHandle, XDAS_UInt8 *buf, XDAS_UInt32 size, XDAS_UInt32 trans_id):
writes into a mailBox to deliver a message to remote; it is a non blocking call
XDAS_Int32 (*mailBoxRead) (XDAS_Void *mailBoxHandle, XDAS_UInt8 *buf, XDAS_UInt32 *size, XDAS_UInt32 *trans_id):
reads from a mailBox; it is a non blocking call
XDAS_Int32 (*mailBoxQuery) (XDAS_Void *mailBoxHandle):
polls mailBoxes for any available messages; it is a non blocking call

Memory Layout for Multichip Partitioned Codec[edit]

This section talks about how the data-sharing needs are addressed when a single codec instance is partitioned across multiple chips. It is possible that we can reserve a physically contiguous chunk of X86 memory and map that to be available in the PCIe space for all the participating chips. This definitely makes the design simple. However, there are 3 disadvantages. a) Any core to core communication with in a single chip needs to go via X86 memory, which is expensive. b) Huge demand for x86 memory foot print (it may not be available in all hosts) and c) if only 2 neighboring chips need to communicate with each other, then it is better to map one chip's DDR to the PCIe space of other chip, so at least, one chip can access local DDR.

Chip-N's Memory Layout in a Multichip partitioned Codec

HEVC Encoder partitioning scheme requires that the input data be available to all the participating chips. While encoding, only adjacent chips need to communicate with each other. Assuming 4 chips use case, 0 <--> 1, 1<-->2, 2<-->3, and 3<-->0 need to communicate at virtual tile boundaries. The memory map shown in the above picture is designed to meet these requirements. Incoming mailboxes are placed in local DDR, where as the outgoing mailboxes are placed in PCIe space, which is mapped into the DDR of the corresponding chips. Similarly, data exchange regions are setup such that to chip N+1 is in local DDR and to chip N-1 is in PCIe mapped space. All the mailboxes to communicate between cores with in chip N and also the mailboxes to communicate with host are placed in the DDR.

This memory placement accomplishes a) reduced memory foot print on X86 (only truly global data structures are placed on X86 memory) b) DSP to DSP shared space that is local to one of the DSPs which provides faster access for that DSP.

Multichip Scheduler and State Machine[edit]

In the preceding couple of sections, we've examined the multicore/multichip communication primitives (IVIDMC interface) and the memory layout. Those two pieces are just the building blocks necessary for the algorithm .This section talks about the Scheduler and State Machine that rides on top of these building blocks. This State Machine is very application specific with insights into all the parallel sections of the algorithm and the data dependencies.

As shown in the above picture, One of the cores assumes the responsibility of being a Master (rest of the cores are slave cores). The master would look query the State Machine to identify which tasks have the data dependencies cleared for dispatch and the scheduler would dispatch those tasks via Master --> Slave Mailboxes. The scheduler would also poll the mailboxes from Slave --> Master, pick up the responses for dispatched tasks and update the State Machine.

Useful Resources and Links[edit]

Product Download and Updates[edit]

For product download and updates, please visit the links listed in the table below.

	Product Download Link
MCSDK Video (2.1 GA) Download	http://software-dl.ti.com/sdoemb/sdoemb_public_sw/mcsdk_video/latest/index_FDS.html
MCSDK Video (2.2 Alpha) Download	http://software-dl.ti.com/sdoemb/sdoemb_public_sw/mcsdk_video/02_02_00_38/index_FDS.html
BIOS MCSDK Download	http://software-dl.ti.com/sdoemb/sdoemb_public_sw/bios_mcsdk/02_01_02_05/index_FDS.html
Desktop Linux SDK Download	http://software-dl.ti.com/sdoemb/sdoemb_public_sw/desktop_linux_sdk/01_00_02_00/index_FDS.html
C6678 Codec Download	http://software-dl.ti.com/dsps/dsps_public_sw/codecs/C6678/index.html

MCSDK Video Instructions[edit]

Please visit the links below to install MCSDK Video, run the video demos, and get the details on how the MCSDK Video demos are developed.

	Wiki Links
Getting Started Guide for PCIe demos	MCSDK Video Getting Started for PCIe based Demos Desktop Linux SDK Getting Started
Getting Started Guide for TFTP demos	MCSDK Video Getting Started for TFTP based Demos
Development Guide	MCSDK Video Development Guide

Technical Support[edit]

For technical discussions and issues, please visit the links listed in the table below.

	Forum/Wiki Link
C66x Multicore forum	http://e2e.ti.com/support/dsp/c6000_multi-core_dsps/f/639.aspx
Multimedia Software Codecs forum	http://e2e.ti.com/support/embedded/multimedia_software_codecs/default.aspx
TI-RTOS forum	http://e2e.ti.com/support/embedded/f/355.aspx
Code Composer Studio forum	http://e2e.ti.com/support/development_tools/code_composer_studio/f/81/t/3131.aspx
TI C/C++ Compiler forum	http://e2e.ti.com/support/development_tools/compiler/f/343/t/34317.aspx
Embedded Processors wiki	http://processors.wiki.ti.com

Note: When asking for help in the forum you should tag your posts in the Subject with “MCSDK VIDEO”, the part number (e.g. “C6678”) and additionally the component (e.g. “NWAL”).

{{

switchcategory:MultiCore=

For technical support on MultiCore devices, please post your questions in the C6000 MultiCore Forum
For questions related to the BIOS MultiCore SDK (MCSDK), please use the BIOS Forum

Please post only comments related to the article MCSDK VIDEO 2.2.0.38 Development Guide here.

Keystone=

For technical support on MultiCore devices, please post your questions in the C6000 MultiCore Forum
For questions related to the BIOS MultiCore SDK (MCSDK), please use the BIOS Forum

Please post only comments related to the article MCSDK VIDEO 2.2.0.38 Development Guide here.

C2000=For technical support on the C2000 please post your questions on The C2000 Forum. Please post only comments about the article MCSDK VIDEO 2.2.0.38 Development Guide here.

DaVinci=For technical support on DaVincoplease post your questions on The DaVinci Forum. Please post only comments about the article MCSDK VIDEO 2.2.0.38 Development Guide here.

MSP430=For technical support on MSP430 please post your questions on The MSP430 Forum. Please post only comments about the article MCSDK VIDEO 2.2.0.38 Development Guide here.

OMAP35x=For technical support on OMAP please post your questions on The OMAP Forum. Please post only comments about the article MCSDK VIDEO 2.2.0.38 Development Guide here.

OMAPL1=For technical support on OMAP please post your questions on The OMAP Forum. Please post only comments about the article MCSDK VIDEO 2.2.0.38 Development Guide here.

MAVRK=For technical support on MAVRK please post your questions on The MAVRK Toolbox Forum. Please post only comments about the article MCSDK VIDEO 2.2.0.38 Development Guide here.

For technical support please post your questions at http://e2e.ti.com. Please post only comments about the article MCSDK VIDEO 2.2.0.38 Development Guide here.

}}

Links

Amplifiers & Linear
Audio
Broadband RF/IF & Digital Radio
Clocks & Timers
Data Converters

DLP & MEMS
High-Reliability
Interface
Logic
Power Management

Processors

Switches & Multiplexers
Temperature Sensors & Control ICs
Wireless Connectivity

MCSDK VIDEO 2.2.0.38 Development Guide

Contents

Introduction[edit]

MCSDK Video Host Application Development[edit]

Multiple Threads to Support Parallel Operation[edit]

Efficient and Fragmentation-Free Memory Management[edit]

Memory Types and Mechanisms for Data Exchange[edit]

Multiple Types of Buffer Pools[edit]

Frame Descriptor to Construct Frames from x86 Buffer Pools[edit]

Control Message Exchange via Pipes and Mailboxes[edit]

Scheduler Design Tailoring to Codecs[edit]

Operation Details for Threads[edit]

Input Thread (File Reader Thread)[edit]

DeviceIO TX Thread[edit]

DeviceIO RX Thread[edit]

Output Thread (File Writer or Display Thread)[edit]

Developing Video Demos on Quad C6678 PCIe card[edit]

Memory Usage on Quad C6678 PCIe card[edit]

Data Flow of H264HP Encoding[edit]

Data Flow of JPEG2000 Decoding[edit]

MCSDK Video DSP Development[edit]

Framework Folders and Make Instructions[edit]

Framework Folders[edit]

Make Instructions[edit]

Integrating New Codec in MCSDK Video DSP Build[edit]

Codec Algorithm[edit]

Codec Client Glue Code[edit]

Update siuVctSupportedCodecs.c[edit]

Edit makefile[edit]

Update Linker Command File[edit]

Create Config Files[edit]

Cache Usage[edit]

Cacheability of DDR[edit]

Cacheability of L1P, L1D and L2[edit]

Cache Control APIs[edit]

Multi Core Video Interface APIs[edit]

Memory Layout for Multichip Partitioned Codec[edit]

Multichip Scheduler and State Machine[edit]

Useful Resources and Links[edit]

Product Download and Updates[edit]

MCSDK Video Instructions[edit]

Technical Support[edit]

Navigation menu

Search