NOTICE: The Processors Wiki will End-of-Life on January 15, 2021. It is recommended to download any files or other content you may need that are hosted on processors.wiki.ti.com. The site is now set to read only.

MCSDK VIDEO 2.1 PCIe Demo Development Guide

From Texas Instruments Wiki
Jump to: navigation, search

TIBanner.png


MCSDK Video

Version 2.x

Development Guide for PCIe based Video Demos

Last updated: 11/11/2013


Introduction[edit]

In MCSDK Video, DSP application built on BIOS-MCSDK executes highly compute intensive video processing on TI C66x multi-core DSPs. MCSDK Video 2.x release also provides a host application built on Linux Desktop SDK to allow a user friendly interface to run video demos on a Linux PC. As shown in the figure below, DSP and Host Processor share common header files for message interpretation and communication through host-DSP mailbox.

MCSDKVideo_HostDSP_Overview.png

Intended audience for this document is the developers who are interested in the design of MCSDK Video host application and how it interacts with MCSDK DSP application. As MCSDK Video host application is built upon Desktop Linux SDK, it is highly recommended to first go over the Development Duide for Desktop Linux SDK before continue. Specifically, this document will discuss the following design aspects of MCSDK Video host application.

  1. Multiple Threads to Support Parallel Operation
  2. Efficient and Fragmentation-Free Memory Management
  3. Control Message Exchange via Pipes and Mailboxes
  4. Scheduler Design Tailoring to Codecs
  5. Operation Details for Threads
  6. Example Memory Usage and Data Flow for H264HP Encoding and JPEG200 Decoding


Multiple Threads to Support Parallel Operation[edit]

MCSDK Video host application is a multi-threaded application so that DSPs run in parallel while host processor prepares subsequent data chunks for further video processing and/or consumes processed video data from the DSPs. In order to use all PCIe lanes concurrently, each device (DSP chip) is supplied with a Tx thread and a Rx thread. The following table lists the threads used in the MCSDK Video host application. For priority of threads, the higher the number, the higher the priority.

Thread Description Number of Thread Priority
Input thread (Filer Reader) Read input data from file into x86 input buffer One per application 1
Output thread (Filer Writer or Display) Write output data in x86 output buffer to file or display One per application 4
DeviceIO Tx thread Provide input data to DSP and notify DSP One per device 2
DeviceIO Rx thread Query DSP and get output data from DSP One per device 3


Efficient and Fragmentation-Free Memory Management[edit]

Built on Desktop Linux SDK, MCSDK Video host application uses SDK's Buffer Manager and Contiguous Memory Driver components. The Contiguous Memory Driver is used to obtain physically contiguous memory on the host which can be mapped over the PCIe interface. The Buffer Manager is used to achieve fragmentation-free memory management for efficient data exchange between host and DSP through EDMA.



Memory Types and Mechanisms for Data Exchange[edit]

While MCSDK Video DSP application conducts highly compute intensive video processing, MCSDK Video host application sends input data to DSPs for processing and/or receives processed data from DSPs for file saving or display. Three types of memory can be involved for such host-DSP data exchange:

  1. Contiguous x86 memory
  2. C66x DDR memory 
  3. C66x PCIe memory (Contiguous x86 buffers are mapped to this region of DSP memory) 

As in Desktop Linux SDK, the data exchange between host and DSP applications can be achieved in three ways.

  1. Host processor's data buffer (i.e., x86 memory) is copied into DSP's DDR (i.e., C66x DDR memory) using EDMA. Since DDR memory is only shared between cores on a single DSP, this is beneficial when a buffer is only needed on a single DSP or DSP core..
  2. DSP picks up the data from host memory buffer (x86 memory mapped to C66x PCIe memory) directly as the DSP consumes the data. This is used when a buffer needs to be shared concurrently with multiple DSPs. 
  3. Host processor copies data directly to DSP memory. This method has a much smaller bandwidth than the previous methods since the host processor is responsible for the transfer. This is typically used when a small amount of data needs to be transferred such as configuration buffers.



Multiple Types of Buffer Pools[edit]

MCSDK Video host application uses Buffer Manager component from Desktop Linux SDK to do fragmentation-free memory allocation. Six types of buffer pools are created and used in MCSDK Video host application.

Buffer Pool Number of Pools Chunk Size Number of Chunks Allocator Thread Freer Thread
x86 Input Pool One per application 4MB 32 File Reader DeviceIO Tx or Rx
x86 Output Pool One per application 4MB 80 DeviceIO Tx or Rx File Writer
C66x DDR Input Pool One per device 16MB 16 DeviceIO TX DeviceIO Rx
C66x DDR Output Pool One per device 16MB 16 DeviceIO TX DeviceIO Rx
C66x PCIe Input Pool One per application 16MB 2 DeviceIO Tx DeviceIO Rx
C66x PCIe Output Pool One per application 8MB 2 DeviceIO Tx DeviceIO Rx

[note: need to add details on how the number of chunks and chunk sizes are determined; not all the buffer pools are used --> need to have specific use cases; need to avoid using DeviceIO "Tx or Rx"]


Frame Descriptor to Construct Frames from x86 Buffer Pools[edit]

In order to use EDMA for efficient data transfer, Contiguous Memory Driver from Desktop Linux SDK is used for allocating x86 input/output buffers in contiguous physical memory. Due to Linux restriction of not being able to allocate more than 4 MB physical contiguous memory, chunk size of x86 Input/Output pools is set to 4MB. Desktop Linux SDK can overcome this limit by configuraing the bootloader the reserve memory outside of the kernel, but this is unreliable in systems with a small amount of memory. Since the size of a contiguous buffer is constant, and the data unit in MCSDK video is video frames and the size can be large, a frame descriptor is introduced to allocate and group multiple x86 buffers for a single frame.

A frame descriptor contains a frame ID, a handle to the buffer pool, and an array of buffer descriptors which specifies the base and length of the buffers from the buffer pool. Using frame descriptor, a frame can be allocated across multiple buffers with arbitrary start and end offsets as shown in the example below. Further these discontinuous 4MB (or less than 4MB) buffers can be mapped to be visible as contiguous buffers in the DSP address space.
FrameBuffer.png

Control Message Exchange via Pipes and Mailboxes[edit]

The four threads of MCSDK Video host application exchange control messages through pipes and mailBox from Desktop Linux SDK to achieve seamless data exchange between the host processor and DSPs. There are two mailboxes for each DSP core, one for storing pending messages from host to DSP and another one for storing DSP to host pending messages. 

  1. When an input frame is read from file to the input buffer, the Input thread (Filer Reader) sends the frame information to DeviceIO Tx thread via writing to the input pipe.
  2. DeviceIO Tx thread reads from the input pipe, transfers the frame data to DSP, and then communicates the input frame information to DSPs using mailbox.
  3. DeviceIO Rx thread queries DSPs via mailbox to find a DSP-processed output frame. It transfers the output data to a contiguous x86 buffer and then sends the output frame information to the Output thread by writing to a Reorder Queue. Through the Reorder Queue, the processed frames from DSPs are reordered according to the frame ID, and written to the output pipe in the correct order.
  4. Output thread (Filer Writer or Display) reads from the output pipe, and then save the output to disk or display.

VideoHost_Arch1.PNG


Scheduler Design Tailoring to Codecs[edit]

In MCSDK Video application, there is a scheduler which keeps track of certain DSP cores (depending on scheduler topology as described below) and does task dispatch to those cores according to DSPs' task load.

Tailoring to specific codecs, the following three scheduler topologies can be supported along with anything in between.

  • Scheduler manages all DSP cores and dispatch tasks to all the cores. Example codec is JPEG2000 decoder.


Scheduler_Topology1.JPG

  • Scheduler manages one master core in each DSP chip. High level tasks are dispatched by the scheduler, while master core in DSP internally dispatches tasks to slave cores. Example codec is AVCIU encoder.


Scheduler_Topology2.JPG

  • Scheduler manages and delegates tasks to a single master DSP core. Master core internally dispatches tasks to slave cores. Example codec is H.264HP encoder.


Scheduler_Topology3.JPG

Scheduler maintains task loading of its managed DSP cores using the number of queued tasks on these cores. At the initial phase, all the managed cores have zero task queued. When a task is assigned to a core, its task count is incremented by one. When processed output from a core is picked up, its task count is decremented by one.

According to the task loading maintained as above, the scheduler in MCSDK Video host application picks up the least busy core for task dispatch starting from the first core.


Operation Details for Threads[edit]

With major components described as above for the MCSDK Video host application, this section presents the operation details for the four threads of the video host application.


Input Thread (File Reader Thread)[edit]

Calling Sequence of Input Thread
  • Allocate x86 input buffer from input buffer pool and place into Frame Descriptor
  • Process iteration, file, and frame indices and open the correct input file
  • Read from input file into input buffer
  • If frame size met, or file ended, call scheduler to get destination node (i.e., DSP core), and send frame descriptor over the input pipe, else continue to allocate new input buffers


Configuration of Input Thread
  • Pipe descriptor
  • Buffer pool handle
  • Number of iterations, files, and frames
  • Size of frame to read before sending over pipe



DeviceIO TX Thread[edit]

Calling Sequence of DeviceIO TX Thread
  • Read message from the input pipe (Blocking); Pipe message contains frame descriptor, device ID, node ID, and frame ID.
  • Look up node in instance
  • Allocate Input DeviceIO descriptor (=> allocate device memory ), copy in frame descriptor
  • Allocate Output DeviceIO descriptor (=> allocate device memory )
  • If output device buffer is in PCIe memory, allocate Output frame descriptor
  • Provide input frame to DSP (map or DMA), and map output frame if needed.
  • If input device buffer is not in PCIe memory, free input frame descriptor
  • Compose mailbox message
  • Poll for DMA completion. Yield between polls
  • Send mailbox message to DSP node
  • If no free buffers, yield, then try again. (Similar to DMA polling)

[note: need to add details for device memory. need to have separate description for mapping and DMA]



DeviceIO RX Thread[edit]

Calling Sequence of DeviceIO RX Thread
  • Query mailboxes to find a node to unload
  • Read mailbox message
  • Find Input DeviceIO descriptor, based on freeBufIdD
  • If input was mapped to PCIe memory, free input frame descriptor buffers
  • Free Input DeviceIO descriptor and corresponding device buffer
  • Find Output DeviceIO descriptor based on outBufPtr
  • If output device buffer is not in PCIe space, allocate output frame descriptor.
  • Get output frame buffers from device. (Start DMA if needed, nothing if mapped)
  • Poll for DMA completion. Yield if not.
  • Call scheduler to unload node.
  • Place output frame in reorder queue.
  • Free Output DeviceIO descriptor, and corresponding device buffer.

[note: need to add details for device memory. need to have separate description for mapping and DMA]



Output Thread (File Writer or Display Thread)[edit]

Calling Sequence of Output Thread
  • Get frame descriptor from Reorder Queue
  • Process iteration, file, frame, and tile counts to get correct output file
  • Write frame to file and free frame descriptor buffers, or display frames


Configuration of Output Thread
  • Reorder queue handle
  • Number of iterations, files, and frames


Developing Video Demos on Quad C6678 PCIe card[edit]

This section discusses the memory usage when developing video demos on Quad C6678 PCIe card. It will also go over the data flow for H264HP encoding and JPEG2000 decoding demos.



Memory Usage on Quad C6678 PCIe card[edit]

The Quad C6678 PCIe card contains four C6678 DSP chips. As shown below, each core has its local L1 and L2 memory. Each chip has 4MB of MSMC and 1GB of dedicated DDR memory which is not available to the other chips. Each chip also has a 256 MB memory range which can be mapped to the memory of other devices connected through PCIe. In this demo, all C6678 devices map 128MB (0x60000000 – 0x67FFFFFF) of this memory space to the same global host x86 memory. This provides a global shared memory region which is available to both the host and all C6678 DSPs.
Memory_Overview.PNG

Furthermore, the Global Shared Memory is divided into multiple regions to allow data exchange between the x86 host and DSPs and communication between DSPs.

  1. Shared Memory (0x60000000 – 0x61FFFFFF): this area is used for communication between DSPs. The IVIDMC multichip implementation uses this area for shared memory and software barriers.
  2. Codec Output Scratch: (0x62000000 – 0x62FFFFFF): this area is divided between all cores and is used for the output of the encoding algorithm. Once all cores have finished processing, the master core will accumulate the outputs generated here to be placed in a host supplied buffer.
  3. Host I/O Buffers ( 0x63000000 – 0x66FFFFFF): this area is owned by the host which will divide it into input and output buffers. The host writes data to the input buffer, supplies an input and output buffer pointer to the master core, and will read the output once the processing is complete.
  4. DMA Buffers (0x67000000 – 0x67FFFFFF): this memory region is used by the host when it triggers DMA transfers into the DDR memory on the DSPs.


Memory_Global_Shared1.PNG



Data Flow of H264HP Encoding[edit]

Data flow of H264HP encoding is described as below. DSP operations are differentiated using the italic text. 

  1. Host [Input Thread] allocates an input buffer from the Contiguous x86 Input Buffer Pool.
  2. Host [Input Thread] writes input frame to input buffer and send the buffer information over pipe to the DeviceIO Tx Thread.
  3. Initially, all slave cores are waiting on a barrier for master core to enter the barrier while master core is polling on mailbox waiting to receive a host message requesting Encode.
  4. Host [DeviceIO Tx Thread] allocates output buffers from the output c66 DDR Buffer Pool.
  5. Host [DeviceIO Tx Thread] send message to master core with input and output buffer pointers and an input ID.
  6. Master core writes input buffer pointer to Shared Memory region and enters the software barrier, thus releasing the slave cores from the barrier.
  7. Slave cores acquire input pointer from Shared Memory region.
  8. All cores exit out of Barrier and process the input and write their output to the Codec Output Scratch region.
  9. Master core accumulates each core’s output and writes it to the output buffer pointer that was supplied by the host.
  10. Master core sends a message to the host with output buffer pointer, output ID, and 0 or more input IDs corresponding to input buffers which can be freed.
  11. Host [DeviceIO Rx Thread] queries master core's mailbox to get the message for the processed frame.
  12. Host [DeviceIO Rx Thread] frees any input buffers (both x86 and c66 PCIe buffers) as indicated by the message from the master core.
  13. Host [DeviceIO Rx Thread] allocates output x86 buffers. The number of buffers allocated is determined by the size of the output c66 DDR buffers.
  14. Host [DeviceIO Rx Thread] triggers the DMA from the output c66 DDR buffer to the output x86 buffer.
  15. Host [DeviceIO Rx Thread] polls for the DMA transfer to be complete.
  16. Host [DeviceIO Rx Thread] frees the output c66 DDR buffer.
  17. Host [DeviceIO Rx Thread] places output frame into the Reorder Queue.
  18. Host [Output Thread] reads output from output x86 buffer pointer, frees output buffer.



Data Flow of JPEG2000 Decoding[edit]

Data flow of JPEG2000 decoding is described as below. DSP operations are differentiated using the italic text. [need to add details]

  1. Host [Input Thread] allocates an input buffer from the Contiguous x86 Input Buffer Pool, and writes input frame to the input buffer 
  2. Host [Input Thread] gets the destination core and DSP from the Scheduler, and sends the buffer and desitnation core information over pipe to the DeviceIO Tx Thread corresponding to the designated DSP.
  3. Initially, all cores are polling on mailbox waiting to receive a host message requesting Decode.
  4. Host [DeviceIO Tx Thread] receives the message over pipe and allocates an input and output buffer from the c66 DDR Buffer Pools and triggers a DMA transfer from the input x86 buffer to the input c66 DDR buffer.
  5. Host [DeviceIO Tx Thread] polls for the DMA transfer to be complete.
  6. Host [DeviceIO Tx Thread] frees the input x86 buffer.
  7. Host [DeviceIO Tx Thread] send message to the designated core with input and output buffer pointers and an input ID.
  8. Core receives the mailbox message and decodes the input c66 DDR buffer and writes the output to the output c66 DDR buffer.
  9. Core sends a message to the host with output buffer pointer, output ID, and 0 or more input IDs corresponding to input c66 DDR buffers which can be freed.
  10. Host [DeviceIO Rx Thread] queries each core's mailbox to get the message for the processed frame.
  11. Host [DeviceIO Rx Thread] frees the input c66 DDR buffer.
  12. Host [DeviceIO Rx Thread] allocates output x86 buffers. The number of buffers allocated is determined by the size of the output c66 DDR buffers.
  13. Host [DeviceIO Rx Thread] triggers the DMA from the output c66 DDR buffer to the output x86 buffer. 
  14. Host [DeviceIO Rx Thread] polls for the DMA transfer to be complete.
  15. Host [DeviceIO Rx Thread] frees the output c66 DDR buffer.
  16. Host [DeviceIO Rx Thread] places output frame into the Reorder Queue.
  17. Host [Output Thread] reads output from output x86 buffer pointer, frees output buffer


Useful Resources and Links[edit]

Product Download and Updates[edit]

For product download and updates, please visit the links listed in the table below.

Product Download Link
MCSDK Video (2.1 GA) Download http://software-dl.ti.com/sdoemb/sdoemb_public_sw/mcsdk_video/latest/index_FDS.html
MCSDK Video (2.2 Alpha) Download

http://software-dl.ti.com/sdoemb/sdoemb_public_sw/mcsdk_video/02_02_00_23/index_FDS.html
http://software-dl.ti.com/sdoemb/sdoemb_public_sw/mcsdk_video/02_02_00_28/index_FDS.html

BIOS MCSDK Download http://software-dl.ti.com/sdoemb/sdoemb_public_sw/bios_mcsdk/02_01_02_05/index_FDS.html
Desktop Linux SDK Download http://software-dl.ti.com/sdoemb/sdoemb_public_sw/desktop_linux_sdk/01_00_00_07/index_FDS.html
C6678 Codec Download http://software-dl.ti.com/dsps/dsps_public_sw/codecs/C6678/index.html


MCSDK Video Instructions[edit]

Please visit the links below to install MCSDK Video, run the video demos, and get the details on how the MCSDK Video demos are developed.

Wiki Links
Getting Started Guide MCSDK Video Getting Started for Linux
MCSDK Video Getting Started for Windows
Desktop Linux SDK Getting Started
Demo Guide Run PCIe based Demos on Advantech DSPC-8681E & DSPC-8682E
Run TFTP based Demos on TMDXEVM6678LXE
Development Guide MCSDK Video Host (via PCIe) Development Guide
MCSDK Video DSP Development Guide


Technical Support[edit]

For technical discussions and issues, please visit the links listed in the table below.

Forum/Wiki Link
C66x Multicore forum http://e2e.ti.com/support/dsp/c6000_multi-core_dsps/f/639.aspx
Multimedia Software Codecs forum http://e2e.ti.com/support/embedded/multimedia_software_codecs/default.aspx
TI-RTOS forum http://e2e.ti.com/support/embedded/f/355.aspx
Code Composer Studio forum http://e2e.ti.com/support/development_tools/code_composer_studio/f/81/t/3131.aspx
TI C/C++ Compiler forum http://e2e.ti.com/support/development_tools/compiler/f/343/t/34317.aspx
Embedded Processors wiki http://processors.wiki.ti.com

NoteNote: When asking for help in the forum you should tag your posts in the Subject with “MCSDK VIDEO”, the part number (e.g. “C6678”) and additionally the component (e.g. “NWAL”).



E2e.jpg {{
  1. switchcategory:MultiCore=
  • For technical support on MultiCore devices, please post your questions in the C6000 MultiCore Forum
  • For questions related to the BIOS MultiCore SDK (MCSDK), please use the BIOS Forum

Please post only comments related to the article MCSDK VIDEO 2.1 PCIe Demo Development Guide here.

Keystone=
  • For technical support on MultiCore devices, please post your questions in the C6000 MultiCore Forum
  • For questions related to the BIOS MultiCore SDK (MCSDK), please use the BIOS Forum

Please post only comments related to the article MCSDK VIDEO 2.1 PCIe Demo Development Guide here.

C2000=For technical support on the C2000 please post your questions on The C2000 Forum. Please post only comments about the article MCSDK VIDEO 2.1 PCIe Demo Development Guide here. DaVinci=For technical support on DaVincoplease post your questions on The DaVinci Forum. Please post only comments about the article MCSDK VIDEO 2.1 PCIe Demo Development Guide here. MSP430=For technical support on MSP430 please post your questions on The MSP430 Forum. Please post only comments about the article MCSDK VIDEO 2.1 PCIe Demo Development Guide here. OMAP35x=For technical support on OMAP please post your questions on The OMAP Forum. Please post only comments about the article MCSDK VIDEO 2.1 PCIe Demo Development Guide here. OMAPL1=For technical support on OMAP please post your questions on The OMAP Forum. Please post only comments about the article MCSDK VIDEO 2.1 PCIe Demo Development Guide here. MAVRK=For technical support on MAVRK please post your questions on The MAVRK Toolbox Forum. Please post only comments about the article MCSDK VIDEO 2.1 PCIe Demo Development Guide here. For technical support please post your questions at http://e2e.ti.com. Please post only comments about the article MCSDK VIDEO 2.1 PCIe Demo Development Guide here.

}}

Hyperlink blue.png Links

Amplifiers & Linear
Audio
Broadband RF/IF & Digital Radio
Clocks & Timers
Data Converters

DLP & MEMS
High-Reliability
Interface
Logic
Power Management

Processors

Switches & Multiplexers
Temperature Sensors & Control ICs
Wireless Connectivity