NOTICE: The Processors Wiki will End-of-Life on January 15, 2021. It is recommended to download any files or other content you may need that are hosted on processors.wiki.ti.com. The site is now set to read only.

MCSDK VIDEO 2.1 PCIeDemo Development Guide

From Texas Instruments Wiki
Jump to: navigation, search

TIBanner.png


MCSDK Video

Version 2.1.0 (Alpha Release)

Development Guide

Last updated: 09/21/2012


Introduction[edit]

In MCSDK Video, DSP application built on BIOS-MCSDK executes highly compute intensive video processing on TI C66x multi-core DSPs. MCSDK Video 2.1 release also provides a host application built on Linux Desktop SDK to allow a user friendly interface to run video demos on a Linux PC. As shown in the figure below, DSP and Host Processor share common header files for message interpretation and communication through host-DSP mailbox.

MCSDKVideo_HostDSP_Overview.png

Intended audience for this document is the developers who are interested in the design of MCSDK Video host application and how it interacts with MCSDK DSP application. As MCSDK Video host application is built upon Desktop Linux SDK, it is highly recommended to first go over the Development Duide for Desktop Linux SDK before continue. Specifically, this document will discuss the following design aspects of MCSDK Video host application.

  1. Multiple Threads to Support Parallel Operation
  2. Scheduler Design Tailoring to Codecs
  3. Data and Control Communication Between Host and DSP Applications
  4. Example Memory Usage and Data Flow for Multi-Chip Multi-core H264BP Encoding


Multiple Threads to Support Parallel Operation[edit]

MCSDK Video host application is a multi-threaded application so that DSPs run in parallel while host processor prepares subsequent data chunks for further video processing and/or consumes processed video data from the DSPs. In order to use all PCIe lanes concurrently, each device (DSP chip) is supplied with a Tx thread and a Rx thread. The following table lists the threads used in the MCSDK Video host application. For priority of threads, the higher the number, the higher the priority.

Thread Description Number of Thread Priority
Input thread (Filer Reader) Read input data from file into x86 input buffer One per application 1
Output thread (Filer Writer or Display) Write output data in x86 output buffer to file or display One per application 4
DeviceIO Tx thread Provide input data to DSP and notify DSP One per device 2
DeviceIO Rx thread Query DSP and get output data from DSP One per device 3


Efficient and Fragmentation-Free Memory Management[edit]

Built on Desktop Linux SDK, MCSDK Video host application uses SDK's Buffer Manager and Contiguous Memory Driver components to achieve fragmentation-free memory management and efficient data exchange between host and DSP through EDMA.



Memory Types and Mechanisms for Data Exchange[edit]

While MCSDK Video DSP application conducts highly compute intensive video processing, MCSDK Video host application sends input data to DSPs for processing and/or receives processed data from DSPs for file saving or display. Three types of memory can be involved for such host-DSP data exchange:

  1. x86 memory
  2. C66x PCIe memory (can be mapped from x86 memory)
  3. C66x DDR memory

As in Desktop Linux SDK, the data exchange between host and DSP applications can be achieved in two ways.

  1. Host processor's data buffer (i.e., x86 memory) is copied into DSP's DDR (i.e., C66x DDR memory) using EDMA
  2. DSP picks up the data from host memory buffer (x86 memory mapped to C66x PCIe memory) directly as the DSP consumes the data



Multiple Types of Buffer Pools[edit]

MCSDK Video host application uses Buffer Manager component from Desktop Linux SDK to do fragmentation-free memory allocation. Six types of buffer pools are created and used in MCSDK Video host application.

Buffer Pool Number of Pools Chunk Size Number of Chunks Allocator Thread Freer Thread
x86 Input Pool One per application 4MB 32 File Reader DeviceIO Tx or Rx
x86 Output Pool One per application 4MB 64 DeviceIO Tx or Rx File Writer
C66x DDR Input Pool One per device 8MB 16 DeviceIO TX DeviceIO Rx
C66x DDR Output Pool One per device 8MB 32 DeviceIO TX DeviceIO Rx
C66x PCIe Input Pool One per application 16MB 2 DeviceIO Tx DeviceIO Rx
C66x PCIe Output Pool One per application 16MB 2 DeviceIO Tx DeviceIO Rx

[note: need to add details on how the number of chunks and chunk sizes are determined; not all the buffer pools are used --> need to have specific use cases; need to avoid using DeviceIO "Tx or Rx"]


Frame Descriptor to Construct Frames from x86 Buffer Pools[edit]

In order to use EDMA for efficient data transfer, Contiguous Memory Driverfrom Desktop Linux SDK is used for allocating x86 input/output buffers in contiguous physical memory. Due to Linux restriction of not being able to allocate more than 4 MB physical contiguous memory, chunk size of x86 Input/Output pools is set to 4MB. On the hand hand, the data unit in MCSDK video is video frames and their size can be larger than 4MB. To address this issue, frame descriptor is introduced to allocate and group multiple x86 buffers for a single frame.

A frame descriptor contains a frame ID, a handle to the buffer pool, and an array of buffer descriptors which specifies the base and length of the buffers from the buffer pool. Using frame descriptor, a frame can be allocated across multiple buffers with arbitrary start and end offsets as shown in the example below. Further these discontinuous 4MB (or less than 4MB) buffers can be mapped to be visible as contiguous buffers in the DSP address space.
FrameBuffer.png

Control Message Exchange via Pipes and Mailboxes[edit]

The four threads of MCSDK Video host application exchange control messages through pipes and mailBox from Desktop Linux SDK to achieve seamless data exchange between the host processor and DSPs. There are two mailboxes for each DSP core, one for storing pending messages from host to DSP and another one for storing DSP to host pending messages. Each mailbox has four nodes.

  1. When an input frame is read from file to the input buffer, the Input thread (Filer Reader) sends the frame information to DeviceIO Tx thread via writing to the input pipe.
  2. DeviceIO Tx thread reads from the input pipe, and then communicates the input frame information to DSPs using mailbox.
  3. DeviceIO Rx thread queries DSPs via mailbox to find a DSP-processed output frame. Then it sends the output frame information to Output thread by writing to a Reorder Queue and then the output pipe. Through the Reorder Queue, the processed frames sent from DSPs are reordered according to the frame ID and then used for correct display.
  4. Output thread (Filer Writer or Display) reads from the output pipe, and then save the output to disk or display.

VideoHost_Arch1.PNG


Scheduler Design Tailoring to Codecs[edit]

In MCSDK Video application, there is a scheduler which keeps track of certain DSP cores (depending on scheduler topology as described below) and does task dispatch to those cores according to DSPs' task load.

Tailoring to specific codecs, the following three scheduler topologies can be supported along with anything in between.

  • Scheduler manages all DSP cores and dispatch tasks to all the cores. Example codec is JPEG2000 decoder.


Scheduler_Topology1.JPG

  • Scheduler manages one master core in each DSP chip. High level tasks are dispatched by the scheduler, while master core in DSP internally dispatches tasks to slave cores.


Scheduler_Topology2.JPG

  • Scheduler manages and delegates tasks to a single master DSP core. Master core internally dispatches tasks to slave cores. Example codec is H.264BP encoder.


Scheduler_Topology3.JPG

Scheduler maintains task loading of its managed DSP cores using the number of queued tasks on these cores. At the initial phase, all the managed cores have zero task queued. When a task is assigned to a core, its task count is incremented by one. When processed output from a core is picked up, its task count is decremented by one.

According to the task loading maintained as above, the scheduler in MCSDK Video host application picks up the least busy core for task dispatch starting from the first core.


Operation Details for Threads[edit]

With major components described as above for the MCSDK Video host application, this section presents the operation details for the four threads of the video host application.


Input Thread (File Reader Thread)[edit]

Calling Sequence of Input Thread
  • Allocate x86 input buffer from input buffer pool and place into Frame Descriptor
  • Process iteration, file, and frame indices and open the correct input file
  • Read from input file into input buffer
  • If frame size met, or file ended, call scheduler to get destination node (i.e., DSP core), and send frame descriptor over the input pipe, else continue to allocate new input buffers


Configuration of Input Thread
  • Pipe descriptor
  • Buffer pool handle
  • Number of iterations, files, and frames
  • Size of frame to read before sending over pipe



DeviceIO TX Thread[edit]

Calling Sequence of DeviceIO TX Thread
  • Read message from the input pipe (Blocking); Pipe message contains frame descriptor, device ID, node ID, and frame ID.
  • Look up node in instance
  • Allocate Input DeviceIO descriptor (=> allocate device memory ), copy in frame descriptor
  • Allocate Output DeviceIO descriptor (=> allocate device memory )
  • If output device buffer is in PCIe memory, allocate Output frame descriptor
  • Provide input frame to DSP (map or DMA), and map output frame if needed.
  • If input device buffer is not in PCIe memory, free input frame descriptor
  • Compose mailbox message
  • Poll for DMA completion. Yield between polls
  • Send mailbox message to DSP node
  • If no free buffers, yield, then try again. (Similar to DMA polling)

[note: need to add details for device memory. need to have separate description for mapping and DMA]



DeviceIO RX Thread[edit]

Calling Sequence of DeviceIO RX Thread
  • Query mailboxes to find a node to unload
  • Read mailbox message
  • Find Input DeviceIO descriptor, based on freeBufIdD
  • If input was mapped to PCIe memory, free input frame descriptor buffers
  • Free Input DeviceIO descriptor and corresponding device buffer
  • Find Output DeviceIO descriptor based on outBufPtr
  • If output device buffer is not in PCIe space, allocate output frame descriptor.
  • Get output frame buffers from device. (Start DMA if needed, nothing if mapped)
  • Poll for DMA completion. Yield if not.
  • Call scheduler to unload node.
  • Place output frame in reorder queue.
  • Free Output DeviceIO descriptor, and corresponding device buffer.

[note: need to add details for device memory. need to have separate description for mapping and DMA]



Output Thread (File Writer or Display Thread)[edit]

Calling Sequence of Output Thread
  • Get frame descriptor from Reorder Queue
  • Process iteration, file, frame, and tile counts to get correct output file
  • Write frame to file and free frame descriptor buffers, or display frames


Configuration of Input Thread
  • Reorder queue handle
  • Number of iterations, files, and frames


Developing Video Demos on Quad C6678 PCIe card[edit]

This section discusses the memory usage when developing video demos on Quad C6678 PCIe card. It will also go over the data flow for H264BP encoding and JPEG2000 decoding demos.



Memory Usage on Quad C6678 PCIe card[edit]

The Quad C6678 PCIe card contains four C6678 DSP chips. As shown below, each core has its local L1 and L2 memory. Each chip has 4MB of MSMC and 1GB of dedicated DDR memory which is not available to the other chips. Each chip also has a 256 MB memory range which can be mapped to the memory of other devices connected through PCIe. In this demo, all C6678 devices map 128MB (0x60000000 – 0x67FFFFFF) of this memory space to the same global host x86 memory. This provides a global shared memory region which is available to both the host and all C6678 DSPs.
Memory_Overview.PNG

Furthermore, the Global Shared Memory is divided into multiple regions to allow data exchange between the x86 host and DSPs and communication between DSPs.

  1. Shared Memory (0x60000000 – 0x61FFFFFF): this area is used for communication between DSPs. The IVIDMC multichip implementation uses this area for shared memory and software barriers.
  2. Codec Output Scratch: (0x62000000 – 0x62FFFFFF): this area is divided between all cores and is used for the output of the encoding algorithm. Once all cores have finished processing, the master core will accumulate the outputs generated here to be placed in a host supplied buffer.
  3. Host I/O Buffers ( 0x63000000 – 0x66FFFFFF): this area is owned by the host which will divide it into input and output buffers. The host writes data to the input buffer, supplies an input and output buffer pointer to the master core, and will read the output once the processing is complete.
  4. DMA Buffers (0x67000000 – 0x67FFFFFF): this memory region is used by the host when it triggers DMA transfers into the DDR memory on the DSPs.


Memory_Global_Shared1.PNG



Data Flow of H264BP Encoding[edit]

Data flow of H264BP encoding is described as below. DSP operations are differentiated using the italic text. [need modification and more details]

  1. Host [Input Thread] allocates an input and output buffer from the Host Shared I/O Buffer region.
  2. Host [Input Thread] writes input frame to input buffer.
  3. Initially, all slave cores are waiting on a barrier for master core to enter the barrier while master core is polling on mailbox waiting to receive a host message requesting Encode.
  4. Host [DeviceIO Tx Thread] send message to master core with input and output buffer pointers and an input ID.
  5. Master core writes input buffer pointer to Shared Memory region and enters the software barrier, thus releasing the slave cores from the barrier.
  6. Slave cores acquire input pointer from Shared Memory region.
  7. All cores exit out of Barrier and process the input and write their output to the Codec Output Scratch region.
  8. Master core accumulates each core’s output and writes it to the output buffer pointer that was supplied by the host.
  9. Master core sends a message to the host with output buffer pointer, output ID, and 0 or more input IDs corresponding to input buffers which can be freed.
  10. Host [DeviceIO Rx thread] queries master core's mailbox to get the message for the processed frame
  11. Host [Output Thread] reads output from output buffer pointer, frees output buffer, and frees any input buffers as indicated by the message from the master core.



Data Flow of JPEG2000 Decoding[edit]

Data flow of JPEG2000 decoding is described as below. DSP operations are differentiated using the italic text. [need to add details]

Related Documents[edit]


Technical Support and Product Updates[edit]

For technical discussions and issues, please visit

NoteNote: When asking for help in the forum you should tag your posts in the Subject with “MCSDK VIDEO”, the part number (e.g. “C6678”) and additionally the component (e.g. “NWAL”).


For product updates,

  • Visit MCSDK VIDEO (Multicore Video Infrastructure Demo Built on MCSDK): TBD

For Video codec products,


E2e.jpg {{
  1. switchcategory:MultiCore=
  • For technical support on MultiCore devices, please post your questions in the C6000 MultiCore Forum
  • For questions related to the BIOS MultiCore SDK (MCSDK), please use the BIOS Forum

Please post only comments related to the article MCSDK VIDEO 2.1 PCIeDemo Development Guide here.

Keystone=
  • For technical support on MultiCore devices, please post your questions in the C6000 MultiCore Forum
  • For questions related to the BIOS MultiCore SDK (MCSDK), please use the BIOS Forum

Please post only comments related to the article MCSDK VIDEO 2.1 PCIeDemo Development Guide here.

C2000=For technical support on the C2000 please post your questions on The C2000 Forum. Please post only comments about the article MCSDK VIDEO 2.1 PCIeDemo Development Guide here. DaVinci=For technical support on DaVincoplease post your questions on The DaVinci Forum. Please post only comments about the article MCSDK VIDEO 2.1 PCIeDemo Development Guide here. MSP430=For technical support on MSP430 please post your questions on The MSP430 Forum. Please post only comments about the article MCSDK VIDEO 2.1 PCIeDemo Development Guide here. OMAP35x=For technical support on OMAP please post your questions on The OMAP Forum. Please post only comments about the article MCSDK VIDEO 2.1 PCIeDemo Development Guide here. OMAPL1=For technical support on OMAP please post your questions on The OMAP Forum. Please post only comments about the article MCSDK VIDEO 2.1 PCIeDemo Development Guide here. MAVRK=For technical support on MAVRK please post your questions on The MAVRK Toolbox Forum. Please post only comments about the article MCSDK VIDEO 2.1 PCIeDemo Development Guide here. For technical support please post your questions at http://e2e.ti.com. Please post only comments about the article MCSDK VIDEO 2.1 PCIeDemo Development Guide here.

}}

Hyperlink blue.png Links

Amplifiers & Linear
Audio
Broadband RF/IF & Digital Radio
Clocks & Timers
Data Converters

DLP & MEMS
High-Reliability
Interface
Logic
Power Management

Processors

Switches & Multiplexers
Temperature Sensors & Control ICs
Wireless Connectivity