EBC Exercise 18 Using the DSP for Audio Processing

In the previous exercise you saw how to bring audio into the Beagle and send it out again. You also did some processing on the audio. All this was done on the ARM processor. The DM3730 on the BeagleBoard has both an ARM processor and a C64x fixed-point DSP. This exercise shows you how to use the DSP via C6Run. C6Run is a set of tools which will take in C files and generate either an ARM executable, or an ARM library which will leverage the DSP to execute the C code.

There are two uses of C6Run, exposed through two different front-end scripts. They are called C6RunLib and C6RunApp. We focus on C6RunLib here.

Examine the Files
Make sure you have the most up to date versions of the files. beagle$ cd ~/exercises beagle$ git pull


 * Change directories to AudioThru/lab06d_audio_c6run.

beagle$ cd lab06d_audio_c6run beagle$ ls

These are the same files as used in the previous audio thru lab. The files audio_process.c and audio_process.h are new. audio_thread.c has been changed slightly. Makefile is completely different.

Edit audio_thread.c and search for audio_process. There are two occurrences. I'll talk about the first on later, go to the second.

beagle$ gedit audio_thread.c

audio_process((short *)outputBuffer, (short *)inputBuffer, blksize/2);

I've replaced the call to memcpy with the call to audio_process. Pointers to the input and output buffers are passed along with the number samples in the buffers. Note blksize is the number of 8-bit chars in the buffers. We're working with 16-bit samples, so everything is converted to shorts.

Look at audio_process.c. Presently all it does is a memcpy when called.

Make
Before the first make the paths need to be set up. Do this:

beagle$ git clone git://github.com/MarkAYoder/c6run_build.git beagle$ source c6run_build/environment.sh beagle$ c6run_build/loadmodules.sh

We are now dealing with two C compilers. One for the ARM the other for the DSP. The first source above sets PATHs for each of the compilers. Take a look at environment.sh to see the details. The most interesting part is PLATFORM_CFLAGS, which we'll discuss later. The second source (loadmodules.sh) loads some kernel modules that are needed to support the DSP. Take a look at it, we'll explain some of it later.

Be sure to source the environment.sh file every time you start a new terminal. Source the loadmodule.sh file every time you reboot your Beagle.

A simple example
beagle$ cd c6run_build/examples/c6runapp/hello_world

Run make, but do it like this so you can see what it creates:

beagle$ make clean beagle$ ls beagle$ gedit hello_world.c & beagle$ time make beagle$ ls -sh beagle$ ./hello_world_arm beagle$ ./hello_world_dsp

hello_world_arm runs on the ARM and hello_world_dsp runs on the DSP. Which is faster? Why?

Get the HelloWorld.c from the gitLearn exercise and compile and run it. Do you get different results comparing the AMR and the DSP? Why?

The audio example
beagle$ cd ~/exercises/audioThru/lab06d_audio_c6run beagle$ make clean beagle$ ls beagle$ time make beagle$ ls -sh

My make takes about 20 seconds to compile everything. What new things do you see?

There are 3 groups of files here: What's significant is that the same source code produced both sets of objects (dsp, gpp). It's how the source is compiled determines where it is run. If you run ./audioThru_arm the code runs only on the ARM. Try it. It should run just like before.
 * 1) The source code (*.c and *.h),
 * 2) object files for running on the ARM (gpp) only (audioThru_arm, gpp, gpp_lib) and
 * 3) objects files for running on the DSP (audioThru_dsp, dsp, dsp_lib).

Running the DSP
Now try running ./audioThru_dsp. This code runs on the ARM, mostly. The function(s) in audio_process.c run on the DSP. Here's what's happening.
 * 1) When you run ./audioThru_dsp, main.c and audio_thread.c run on the ARM as before.
 * 2) C6run has inserted a stub for audio_process so that when is it called the first time on the ARM, it checks to see if the DSP has been initialized.  If not, the ARM loads the code for audio_process on the DSP and then passes the parameters to the DSP and tells it to run the code.
 * 3) Notice the line: Starting DSP...1.5XXX s.  This is printed after the first call to audio_process.  Starting the DSP takes about 1.5s or so. This is why the dummy call to process_audio appears before the loop.  If we waiting until inside the while loop to start the DSP, the ALSA output would have a buffer underrun while waiting for the DSP to start. Fortunately it only has to start once.
 * 4) The ARM waits for the code to complete on the DSP (i.e. it blocks so other processes on the ARM can run.)
 * 5) Once the DSP completes, the ARM continues running.
 * 6) Subsequent calls to audio_process need only pass the parameters and tell the DSP to go.

Details on Using C6Run
So we have a working example of how to use the DSP, but a lots of details have been skipped. Here are some details.

Inside Makefile
So how did the Beagle know what to run on the ARM and what to run on the DSP? The answer is in the Makefile. Take a look at it. The first section sets up the PATHs and FLAGS for the ARM compiler. The next section does the same for the DSP compiler. One interesting flag is --C6Run:replace_malloc. We'll discuss it in the next section. The next section is the important one. EXEC_SRCS := main.c audio_input_output.c audio_thread.c EXEC_ARM_OBJS := $(EXEC_SRCS:%.c=gpp/%.o) EXEC_DSP_OBJS := $(EXEC_SRCS:%.c=dsp/%.o)
 * 1)   Name of the ARM GCC compiler & archiver
 * 1)   Name of the ARM GCC compiler & archiver
 * 1)   Name of the DSP C6RUN compiler & archiver
 * 1)   Name of the DSP C6RUN compiler & archiver
 * 1)   List of source files
 * 2) List the files to run on the ARM here
 * 1) List the files to run on the ARM here
 * 1) List the files to run on the ARM here

LIB_SRCS := audio_process.c LIB_ARM_OBJS := $(LIB_SRCS:%.c=gpp_lib/%.o) LIB_DSP_OBJS := $(LIB_SRCS:%.c=dsp_lib/%.o) Here is where you tell which files run on the ARM and which on the DSP. EXEC_SRCS is a list of the .c files that run on the ARM. LIB_SRCS is the list for the DSP. It's that easy.
 * 1) List the files to run on the DSP here

Further down you see the rules for building the ARM only code. Look them over until you understand what they are doing. The next section is for the DSP. Notice the ARM_CC is used for the files listed in EXEC_DSP_OBJS and the C6RUN_CC is used for those in LIB_DSP_OBJS.
 * 1)   Rules for build and ARM (gpp) only target
 * 1)   Rules for build and ARM (gpp) only target
 * 1)   Rules for build and ARM/DSP (dsp) target
 * 1)   Rules for build and ARM/DSP (dsp) target

The last thing to note is at the very end. If the variable DUMP is defined, the values of many of the Makefile variables are displayed.

Sharing Memory
The ARM and the DSP share memory, so when we called audio_process all we had to do was pass pointers to the buffers we wanted to process. There was no need to copy from one processor to another, therefore very little overhead. However, are some details that were handled for you that you need to know.

The ARM uses a memory management unit (MMU) that maps virtual addresses to physical addresses. The DSP doesn't have an MMU. That means the pointers on the ARM (outputBuffer, inputBuffer) point to a virtual address and the pointers on the DSP (outputBuffer, inputBuffer) point to physical addresses, which probably aren't the same. C6Run automatically provided the code needed to map from the virtual address space to the physical.

But there is a bigger problem. outputBuffer and inputBuffer were allocated at run time using the standard C routine malloc. malloc allocates contiguous memory of the desired size; however it is contiguous in the virtual space, but probably not contiguous in the physical space. This causes problems for the DSP.

The solution is in the loader flag --C6Run:replace_malloc noted earlier. This tells the loader to replace malloc with cmem. cmem is an API and library for managing one or more blocks of physically contiguous memory. It also provides address translation services (e.g. virtual to physical translation). If you are uncomfortable with replacing all mallocs with cmem, you can remove the --C6Run:replace_malloc flag and call C6RUN_MEM_malloc(N * sizeof(short)) in your code when you have memory to share with the DSP.

So where does cmem allocate the memory? There is a section of the Beagle's RAM that Linux doesn't control. Try this:

beagle$ cat /proc/cmdline console=tty0 console=ttyS2,115200n8 consoleblank=0 mpurate=auto buddy=none camera=lbcm3m1 vram=24M omapfb.mode=dvi:hd720 mem=99M@0x80000000 mem=384M@0x88000000 omapfb.vram=0:12M,1:8M,2:4M omapdss.def_disp=dvi root=/dev/mmcblk0p2 rw rootfstype=ext3 rootwait

What you see are the arguments that were passed to the Linux kernel when it first started. The mem= arguments are telling it what memory it can use. 99M bytes start at 0x8000 0000 and another 384M start at 0x8800 0000. Where does the first block of addresses end? beagle$ bc obase=16 99*1024*1024 6300000 ibase=16 6300000+80000000 86300000 So the block ends at 0x86300000. cmem controls the memory between that and 0x8800 0000. How does cmem know that? Look in ~/c6run_build/loadmodules.sh. You find: DSP_REGION_START_ADDR="0x86300000" DSP_REGION_END_ADDR="0x88000000" CMEMK_OPTS="phys_start=$DSP_REGION_START_ADDR phys_end=$DSP_REGION_END_ADDR allowOverlap=1" modprobe cmemk ${CMEMK_OPTS} Those START and END numbers look familiar. This is the code that installs the cmem kernel module. When it is installed, it's told what memory region it can use.
 * 1) Insert CMEM as all heap (only a portion will actually be used as such)

Here are some notes about how big the region must be. Check it out before you make changes.

How does C6Run know what memory is used? Look in ~/c6run_build/environment.sh. At the bottom you'll see: PLATFORM_CFLAGS='-DDSP_REGION_BASE_ADDR=0x86300000 -DDSP_REGION_CMEM_SIZE=0x01000000 -DDSP_REGION_CODE_SIZE=0x00D00000 -DLPM_REQUIRED -DDSP_HAS_MMU' Here #define values are being defined that tell where to put the code. If you have to move where the DSP memory is (and you shouldn't have to), be sure all these locations are changed together.

Explore the Stub Files
For those who are interested, you can see the code used to talk to the DSP. Edit Makefile and add --C6Run:debug to the FLAGS:

C6RUN_CFLAGS = -c -O3 -D_DEBUG_ --C6Run:debug C6RUN_ARFLAGS = rcs --C6Run:replace_malloc --C6Run:debug

Then recompile everything:

beagle$ make dsp_clean beagle$ make dsp_exec beagle$ ls -sh

Many interesting files appear. -debug.dsp_image.map show the memory map for the DSP. To edit it you need to do: beagle$ gedit ./-debug.dsp_image.map The ./ is needed, otherwise gedit will think -debug is an option.

To see how the DSP is handled on both the ARM and DSP side, look in the dsp_lib directory.

beagle$ cd dsp_lib beagle$ gedit *.c

These two stub files show what's happening on the ARM side (audio_process.gpp_stub.c) and the DSP side (audio_process.dsp_stub.c).

When you are done, be sure to remove --C6Run:debug from the Makefile. You'll also have to rm the extra files since make clean doesn't clean them up.

Explore the Object Files
I want to add details about how to use objdump and dis6x to pull the DSP object data from the ARM file and disassemble it.

Explore More Examples
You can find more examples to explore in ~/c6run_build/examples. c6runlib has an example of taking the fft of some data. The fft runs slower on the DSP than the ARM since it's a floating-point fft routine and the ARM has floating-point hardware and the DSP doesn't.

Late you'll see how to run TI's fixed-point routine on the DSP and it will run some 8 times faster than on the ARM.

c6runapp shows the other form of C6Run. Here the entire C program is run on the ARM. All the I/O from the DSP to the ARM is handled automatically.

Assignment - Experiment with the code
Now that you have something working, play around a bit. git is installed so you can preserve the present contents of the files with:

beagle$ git add Makefile audio_input_output.c audio_process.c audio_thread.c beagle$ git commit -m "My changes"

If needed you can use git to retrieve the original version of the files.

Things to try:


 * There are places in the code where timing can be displayed. Remove the comments and display the times.  How often is the main loop executed?  How long does the DSP take? What's the overhead for the DSP?
 * Try making the DSP do more than pass through. For example, zero out the left channel. Does sound stop coming out?
 * Try changing the sampling rate and buffer sizes. What setting causes the buffers to overflow or underflow?
 * Switch the input to the microphones on the web cam and listen to your voice.
 * Implement a frequency inverter by multiplying every other element by -1.

Extras (not required)
 * Implement your own processing on the DSP. Do a simple FIR lowpass filter, etc.
 * Implement this Voice Scrambler on the DSP.