NOTICE: The Processors Wiki will End-of-Life on January 15, 2021. It is recommended to download any files or other content you may need that are hosted on processors.wiki.ti.com. The site is now set to read only.
MCSDK HPC 3.x FFTW Library
The TI FFTW Library is a library with FFTW API, optimized for TI K2H ARM+DSP platforms. The FFTW API and documentation can be found at http://www.fftw.org/.
Contents
Building TI FFTW Library[edit]
This section explains how TI FFTW library is built. This is part of MCSDK HPC build process. User do not need to repeat it.
- Building FFTW
TI FFTW library build depends on a compiled, installed FFTW 3.3.4 in single/double precision for ARM/Linux.
- Building the Accelerated FFTW Library for K2H
In order to accelerate certain FFTW functions with C66x DSP on K2H platform. Certain FFTW functions are renamed in the library archive. For example, fftw_plan_dft_1d is renamed to __real_fftw_plan_dft_1d. TI implementation of fftw_plan_dft_1d is added to the library. This function will determine, based on FFT size, if a plan can be accelerated by the DSP. If the FFT size is supported by C66x DSP and DSP offers a performance advantage, then, this FFT will be accelerated by DSP, otherwise, the native ARM FFTW implementation __real_fftw_plan_dft_1d is used.
Object Libraries[edit]
After installation, TI FFTW object libraries will be located at /usr/lib. There are two libraries listed below:
- libfftw_acc.a
- libfftwf_acc.a
Header Files[edit]
After installation, TI FFTW header files FFTW3.h will be located at /usr/include
Using TI FFTW Library[edit]
TI FFTW library uses the exact same API as FFTW library. No source code change is required for any user application that uses FFTW API. User just need to link with libfftw_acc.a (double precision) and libfftwf_acc.a (single-precision) libraries provided with MCSDK HPC release.
Run Time Configuration[edit]
TI FFTW can be configured to run on either ARM or DSP (offloading).
The run time configuration is done through the following environment variables:
- TI_FFTW_OFFLOAD: to configure FFTW (double precision) offloading.
- 0: no offloading to DSP, i.e. always running on ARM - 1: forced offloading to DSP, i.e. always running on DSP - 2: optimum offloading to DSP based on FFT sizes in order to achieve best performance in terms of execution time
- Default offloading configuration, when environment variable TI_FFTW_OFFLOAD is not set, is 2.
- TI_FFTWF_OFFLOAD: to configure FFTWF (single precision) offloading.
- 0: no offloading to DSP, i.e. always running on ARM - 1: forced offloading to DSP, i.e. always running on DSP - 2: optimum offloading to DSP based on FFT sizes in order to achieve best performance in terms of execution time
- Default offloading configuration, when environment variable TI_FFTWF_OFFLOAD is not set, is 2.
Benchmarks and Size Limitations for DSP Accelerated Functions[edit]
In FFTW 3.0.0 release, the following FFTW functions/types are being accelerated by DSP. This section discusses in details, for each accelerated FFTW, which FFT size can be accelerated by C66x DSP, and detailed performance comparison of native FFTW executed on ARM (Note: ARM performance is single-core measurement) and accelerated by DSP.
For all c2r/r2c functions, only out-of-place operation is supported by DSP, in-place is NOT supported.
The benchmark information found here can also be generated by doing ./genstats in each FFTW example directory included in MCSDK HPC release.
The benchmark are obtained from HP ProLiant m800 card. Both ARM and DSP are running at 1Ghz clock rate. C66x DSP L2 is configured as all SRAM, L1 is configured as all cache.
- 1-D Double-precision Complex-to-Complex benchmark
- FFT size N needs to be a multiple of 64 and positive. - FFT maximum size limit is 1048576.
- 2-D Double-precision Complex-to-Complex benchmark
- FFT sizes for both dimensions N1 and N2 are multiples of 32 and positive. - FFT maximum size limits for both dimensions are 1024.
- 3-D Double-precision Complex-to-Complex benchmark
- FFT sizes for all dimensions N1, N2 and N3 are multiples of 32 and positive. - FFT maximum size limits for all dimensions are 192.
- 1-D Double-precision Real-to-Complex/Complex-to-Real benchmark
- FFT size N needs to be a multiple of 128 and positive. - FFT size maximum limit is 1048576.
- 2-D Double-precision Real-to-Complex/Complex-to-Real benchmark
- FFT sizes first dimension is a multiple of 32 and positive, second dimension is a multiple of 64 and positive. - FFT maximum size limits for both dimensions are 1024.
- 3-D Double-precision Real-to-Complex/Complex-to-Real benchmark
- FFT sizes for first two dimensions N1 and N2 are multiples of 32 and positive. - FFT size for third dimension N3 is a multiple of 64 and positive. - FFT maximum size limits for all dimensions are 192.
- 1-D Single-precision Complex-to-Complex benchmark
- FFT size N needs to be a multiple of 64 and positive. - FFT maximum size limit is 1048576.
- 2-D Single-precision Complex-to-Complex benchmark
- FFT sizes for both dimensions N1 and N2 are multiples of 64 and positive. - FFT maximum size limits for both dimensions are 2048.
- 3-D Single-precision Complex-to-Complex benchmark
- FFT sizes for all dimensions N1, N2 and N3 are multiples of 64 and positive. - FFT maximum size limits for all dimensions are 256.
- 1-D Single-precision Real-to-Complex/Complex-to-Real benchmark
- FFT size N needs to be a multiple of 128 and positive. - FFT size maximum limit is 1048576.
- 2-D Single-precision Real-to-Complex/Complex-to-Real benchmark
- FFT sizes first dimension is a multiple of 64 and positive, second dimension is a multiple of 128 and positive. - FFT maximum size limits for both dimensions are 2048.
- 3-D Single-precision Real-to-Complex/Complex-to-Real benchmark
- FFT sizes for all dimensions N1, N2 and N3 are multiples of 64 and positive. - FFT maximum size limits for all dimensions are 256.
- 1D Double-precision Batched FFT Complex-to-Complex benchmark
- Only supports 1D batched FFT - Batch size is at least 8. - FFT size is a multiple of 4 and positive. - FFT maximum size limit is 8192. - Input data resides in consecutive memory space (istride=1, idist=N) - Output data resides in consecutive memory space (ostride=1, odist=N)
- 1D Double-precision Batched FFT Real-to-Complex/Complex-to-Real benchmark
- Only supports 1D batched FFT - Batch size is at least 8. - FFT size is a multiple of 8 and positive. - FFT maximum size limit is 8192. - Input data resides in consecutive memory space (istride=1, idist=N) - Output data resides in consecutive memory space (ostride=1, odist=N)
- 1D Single-precision Batched FFT Complex-to-Complex benchmark
- Only supports 1D batched FFT - Batch size is at least 8. - FFT size is a multiple of 8 and positive. - FFT maximum size limit is 16384. - Input data resides in consecutive memory space (istride=1, idist=N) - Output data resides in consecutive memory space (ostride=1, odist=N)
- 1D Single-precision Batched FFT Real-to-Complex/Complex-to-Real benchmark
- Only supports 1D batched FFT - Batch size is at least 8. - FFT size is a multiple of 16 and positive. - FFT maximum size limit is 16384. - Input data resides in consecutive memory space (istride=1, idist=N) - Output data resides in consecutive memory space (ostride=1, odist=N)