NOTICE: The Processors Wiki will End-of-Life on January 15, 2021. It is recommended to download any files or other content you may need that are hosted on processors.wiki.ti.com. The site is now set to read only.

C6457 Improvements Over C6455

From Texas Instruments Wiki
Jump to: navigation, search

Important Note:

The information on this wiki is no longer being actively developed. This wiki is in maintenance mode and the device architecture is supported on C64x Multi-core E2E forum


Cache Improvements in TMS320C6457 when compared to TMS320C6455


The TMS320C6457 device contains a 32KB Level 1 Program Memory (L1P), a 32KB Level 1 Data Memory (L1D) and a 2048KB Level 2 Memory (L2) just like the C6455.

The biggest improvement in C6457 is the L2 cache size. The L2 cache size of this device can be configured to a maximum of 1MB. The C6455 only has a maximum cache size of 256Kbytes.

We can see that there is a 4x increase in L2 cache size on the C6457. Because of this larger cache size, READ misses will be much less frequent resulting in a significant improvement in performance.

Figure below shows the comparison of the L2 memory between C6457 and C6455.


Figure1.JPG




L2 Cache Structure

The L2 cache architecture is a 4 way set-associative organization with a read and write allocate protocol (L1 has a read-allocate-only protocol). Set associative caches are caches which have multiple cache ways to reduce the probability of conflict misses and L2 has 4 of these ways. A line is allocated in L2 cache when there is a read miss or a write miss. Lines are allocated in L1 cache only when there is a read miss (On an L1 miss, data is simply written to the L2 through a write buffer bypassing the L1D).

It was noticed that having a larger active data set in cache increases the performance of some control intensive applications. While processing these types of control applications, eg. MAC layer processing, linked list processing etc, the active data set tends to be large and hence it is beneficial to have a large cache size. The C6455 had 256KB of L2 cache memory whereas the L2 cache memory size is up to 1MB in size. In some packet processing and MAC layer applications, an improvement of up to 30% was seen in C6457 when compared to C6455.

The organization structure remains the same in C6455 and C6457 but because of the increase in cache size in C6457, there will be more number of lines being cached per way. The number of lines per way for C6455 is 512 and the number of lines per way for C6457 is 2048 in their maximum cache size configurations. Figure below shows the L2 organization for C6457.

[300px]    

For more details on the cache architecture, please read the C64x+ DSP User’s Guide. (http://focus.ti.com/lit/ug/spru862b/spru862b.pdf)

L2 Cache Read/Write Miss: Whenever there is a CPU read/write miss in L2 cache, it results in stalling of the CPU while the cache controller retrieves the data from external memory. The external miss penalty varies depending on the type and width of external memory used to hold external data as well as other aspects of system loading. This process of allocating a new line in L2 cache can result in a victim write-back which is explained below.

When there is a read/write miss in L2 cache and a line has to be evicted to accommodate the newly accessed line, the evicted data needs to be written back to its lower level of memory if it contains CPU updated data (dirty data). This is needed to maintain memory consistency. When this updated data is evicted from L2 cache, the cache controller first moves the data to a victim buffer which serves as the buffer between the L2 and external memory.

Once the data is moved to the victim buffer, the evicted line can be re-allocated with the read miss data. In C6455, the victim buffer is emptied to the external memory strictly after the re-allocation from external memory. But in C6457, the re-allocation from external memory takes place at the same time when the victim buffer is being emptied into the external memory. Since accessing the external memory has considerable stalls in the number of cycles, this pipelining may result in a reduction of stalls depending on the application.

For example, whenever there are back-to-back dirty write backs from L2 cache to external memory, the C6457 may not have any additional stalls because the other write backs are pipelined (i.e pipelining of the new line allocation and the victim buffer emptying). Where as the C6455 may have additional stalls in this scenario.

The Dirty Write Back Process is explained in the figure below.



L2 Dirty Write Back.JPG


Bridge-less Access to DDR2:

Another improvement which can be seen in C6457 is that it has a 128 bit EMIF bus line from the Switched Central Resource (SCR) to the External Memory. On the C6455 there is a bridge between the SCR and the DDR2 memory controller and only a 64 bit bus width was available. Figure below shows the EMIF access to DDR2 on the C6457 device. Because of the 128 bit bus width available on the C6457, whenever there is an access call from the L2 to the external memory, it has lower latency when compared to C6455 to fetch the data into the L2 cache. In addition, C6455 supported DDR2-533 whereas the C6457 supports DDR2-667 which results in faster access to the external memory.




DDR2 SCR Bridge.JPG


Cacheable Write Performance of C6457 when compared to C6455:

L2 cache is a write allocate cache. For any write operation it always reads the 128 bytes including the accessed data into a cache line firstly and then modifies the data in the L2 cache. This data will be written back to external memory if cache conflict happens or by manual write back. When the memory stride (memory address increment) equals or is larger than 1024 bytes, the cycles for write operation increases dramatically because the conflict happens frequently for big memory stride. Thus every write operation may result in a cache line write back (for conflict) and a cache line read (for write-allocate).

The dirty cache line write back performance of C6457 is improved when compared to C6455. Figure 5 compares the cacheable write performance of both these devices.

It can be noticed in Figure below that for big memory strides, C6457 cacheable write back performance is much better than C6455 because of the improved dirty cache line write-back in C6457.



C6457 Benchmark.JPG

E2e.jpg {{
  1. switchcategory:MultiCore=
  • For technical support on MultiCore devices, please post your questions in the C6000 MultiCore Forum
  • For questions related to the BIOS MultiCore SDK (MCSDK), please use the BIOS Forum

Please post only comments related to the article C6457 Improvements Over C6455 here.

Keystone=
  • For technical support on MultiCore devices, please post your questions in the C6000 MultiCore Forum
  • For questions related to the BIOS MultiCore SDK (MCSDK), please use the BIOS Forum

Please post only comments related to the article C6457 Improvements Over C6455 here.

C2000=For technical support on the C2000 please post your questions on The C2000 Forum. Please post only comments about the article C6457 Improvements Over C6455 here. DaVinci=For technical support on DaVincoplease post your questions on The DaVinci Forum. Please post only comments about the article C6457 Improvements Over C6455 here. MSP430=For technical support on MSP430 please post your questions on The MSP430 Forum. Please post only comments about the article C6457 Improvements Over C6455 here. OMAP35x=For technical support on OMAP please post your questions on The OMAP Forum. Please post only comments about the article C6457 Improvements Over C6455 here. OMAPL1=For technical support on OMAP please post your questions on The OMAP Forum. Please post only comments about the article C6457 Improvements Over C6455 here. MAVRK=For technical support on MAVRK please post your questions on The MAVRK Toolbox Forum. Please post only comments about the article C6457 Improvements Over C6455 here. For technical support please post your questions at http://e2e.ti.com. Please post only comments about the article C6457 Improvements Over C6455 here.

}}

Hyperlink blue.png Links

Amplifiers & Linear
Audio
Broadband RF/IF & Digital Radio
Clocks & Timers
Data Converters

DLP & MEMS
High-Reliability
Interface
Logic
Power Management

Processors

Switches & Multiplexers
Temperature Sensors & Control ICs
Wireless Connectivity