NOTICE: The Processors Wiki will End-of-Life on January 15, 2021. It is recommended to download any files or other content you may need that are hosted on processors.wiki.ti.com. The site is now set to read only.

Common Issue Resulting in Slow External Memory Performance

From Texas Instruments Wiki
Jump to: navigation, search

Introduction[edit]

Slow memory performance is a very common issue that comes up from time to time for both ARM and DSP users. This issue can pertain to external DRAM (SDRAM, DDR2, DDR3, etc.) or even accesses to SRAM or FPGAs through the GPMC/EMIF. Frequently this comes up in the context of processor evaluation, or sometimes when doing board bring-up. Normally the user is not using any operating system, and is simply configuring the memory interface and then running some very simple code to do a bunch of reads/writes. This seemingly simple test setup is actually fraught with issues which many engineers are not aware of.

CPU Pipelining[edit]

Keep in mind that modern CPUs are heavily pipelined. In short, this breaks up the task of executing instructions into many small sub-tasks that can be executed in parallel. This includes things like fetching instructions, unpacking to determine instruction boundaries, decoding instructions, fetching corresponding data, executing the actual instruction, and writing any results. The pipeline all moves in lockstep, so if any given phase gets stalled then the entire CPU is stalled.

Cache Architecture[edit]

In order to feed the pipeline efficiently, we need to try to avoid any bottlenecks. The main bottlenecks with respect to the CPU are waiting for instructions and waiting for data. An access to external memory is extremely slow and cannot come anywhere close to keeping the pipeline fed. For this reason, cache architectures are used in order to more effectively keep the pipeline filled with instructions and to reduce the access times for data.

There are two key items that need to be configured with regard to cache:

  1. Enabling the cache itself. (There are frequently controls for L1P, L1D, and L2.)
  2. Setting memory attributes for a given region of external memory (i.e. make it cacheable).

This is the place where we start to run into issues. Programmers who are doing a simple bare metal test of the external memory interface are frequently just powering up the device, configuring PLLs, configuring the memory interface, and then running a test.

Solutions[edit]

So if you're using a CPU with external memory, without the cache properly enabled you will experience huge delays between accesses. The CPU will be stalled for prolonged periods of time waiting for data to come back. It won't even attempt to get the next data because the pipeline will be stalled.

So how do you fix the issue? The main ways are:

  1. Use DMA! If your real goal is to benchmark the memory interface, your best option is using the DMA. The DMA is designed for data movement and will be much better in terms of bursting, etc. This will be a far more accurate representation of the memory bandwidth.
  2. Configure the cache (correctly!). Configuring the memory attributes is generally the trickiest part of this process.
    • ARM: The memory attributes exist as part of the MMU page tables. So in order to have cacheable memory, it is necessary to setup MMU page tables and enable the MMU. When the MMU is disabled, the ARM defaults to "strongly ordered" accesses which are non-cacheable and non-bufferable, i.e. VERY SLOW.
    • DSP: The memory attributes exist as a set of "Memory Attribute Registers (MAR)". There is a MAR corresponding to any given region of memory (16MB regions) and a cacheability bit in each MAR.
  3. Use a larger data fetch instruction. For example, if you access a "long long" data type (64 bit) then the GPMC will break that into multiple smaller bus accesses. ARM has a "load multiple" instruction in assembly language which can be useful too.

Related to SRAM or FPGA[edit]

If you're testing performance to an SRAM or FPGA through the GPMC/EMIF, you might have a scenario where you need to use the CPU but you don't want it to be cached. For example, if you've implemented a bunch of memory mapped registers in an FPGA then you would clearly not want those to be cacheable. In this scenario there's no getting around the fact that reads are going to be very slow. For writes, you can make a substantial improvement. On ARM devices you can configure the memory range as "device" memory which means that it is non-cacheable, but it IS bufferable. In other words, this will allow writes to become "fire and forget" such that they go into a write buffer and allow the CPU to continue forward (i.e. you would only stall if you did so many consecutive writes that you filled the entire write buffer, in which case you would stall until a place became free).

E2e.jpg {{
  1. switchcategory:MultiCore=
  • For technical support on MultiCore devices, please post your questions in the C6000 MultiCore Forum
  • For questions related to the BIOS MultiCore SDK (MCSDK), please use the BIOS Forum

Please post only comments related to the article Common Issue Resulting in Slow External Memory Performance here.

Keystone=
  • For technical support on MultiCore devices, please post your questions in the C6000 MultiCore Forum
  • For questions related to the BIOS MultiCore SDK (MCSDK), please use the BIOS Forum

Please post only comments related to the article Common Issue Resulting in Slow External Memory Performance here.

C2000=For technical support on the C2000 please post your questions on The C2000 Forum. Please post only comments about the article Common Issue Resulting in Slow External Memory Performance here. DaVinci=For technical support on DaVincoplease post your questions on The DaVinci Forum. Please post only comments about the article Common Issue Resulting in Slow External Memory Performance here. MSP430=For technical support on MSP430 please post your questions on The MSP430 Forum. Please post only comments about the article Common Issue Resulting in Slow External Memory Performance here. OMAP35x=For technical support on OMAP please post your questions on The OMAP Forum. Please post only comments about the article Common Issue Resulting in Slow External Memory Performance here. OMAPL1=For technical support on OMAP please post your questions on The OMAP Forum. Please post only comments about the article Common Issue Resulting in Slow External Memory Performance here. MAVRK=For technical support on MAVRK please post your questions on The MAVRK Toolbox Forum. Please post only comments about the article Common Issue Resulting in Slow External Memory Performance here. For technical support please post your questions at http://e2e.ti.com. Please post only comments about the article Common Issue Resulting in Slow External Memory Performance here.

}}

Hyperlink blue.png Links

Amplifiers & Linear
Audio
Broadband RF/IF & Digital Radio
Clocks & Timers
Data Converters

DLP & MEMS
High-Reliability
Interface
Logic
Power Management

Processors

Switches & Multiplexers
Temperature Sensors & Control ICs
Wireless Connectivity