BeagleBoard/GSoC/2010 Projects/XBMC

Project: XBMC
Student: Tobias Arrskog

Mentors: Mike Zucchi, Mans Rullgard, Søren Steen Christensen

Repository-git: git://xbmc.git.sourceforge.net/gitroot/xbmc/xbmc and branch gsoc-2010-beagleboard

Repository-svn: http://xbmc.svn.sourceforge.net/svnroot/xbmc/branches/gsoc-2010-beagleboard

Blog: http://xbmc.org/author/topfs2/

Latest blog entries:

Weekly report 12

Weekly report 11

Weekly report 10

What to do when you have the dirty regions

Announcement

Lightning talk:

http://www.youtube.com/watch?v=gvJ32T-W3Gw

http://vimeo.com/12917275

Abstract
Before this summer it was proposed that by limiting what needed to be rendered, when it needed to be rendered and what to be rendered it might just be possible to make XBMC Media Center usable on the Beagle Board. Having XBMC be able to work with a device like beagle board can open a big world for XBMC as it makes it a viable application for use in embedded systems found in TVs, phones, dvdplayers, set-top-boxes. While XBMC is offloading a ton of work to the GPU on an embedded platform such as the beagleboard its not enough to only leverage the GPU one also need to leverage the other processing cores to get the job done and it was therefor also suggested to use the OMAP Overlay to make video decoding alot smoother.

Dependencies
Koen have been nice enough to add a dependency list to narcissus but for those that doesn't want to make a new image here is the package list

opkg install task-native-sdk boost-dev libgles-omap3-dev libsamplerate0-dev liblzo2-dev bzip2-dev libwavpack-dev mpeg2dec-dev libfribidi-dev libpcre-dev libcdio-dev libmodplug-dev flac-dev libsdl-mixer-1.2-dev libsdl-image-1.2-dev alsa-dev enca-dev libxt-dev libxtst-dev libxmu-dev libxinerama-dev curl-dev libmicrohttpd-dev gperf cmake zip git python-devel openssl-dev cvs pkgconfig-dev libxrender-dev libxrandr-dev git glibc-gconv-ibm850 glibc-charmap-ibm850 angstrom-version

Interesting patches for beagleboard
To enable the experimental omap overlay video renderer use configure option --enable-omap-overlay

To enable dirty region based rendering add this as advancedsettings.xml (~/.xbmc/userdata/advancedsettings.xml)

1

Note that algorithm 0 is just redraw everything always, 1 is unified region, 2 is cost reduction.

Build Instructions (native)
Since XBMC is a big application it takes up the entire ram of the beagleboard on link, this means that we need swap (here is how this can be done http://www.redhat.com/docs/manuals/linux/RHL-8.0-Manual/custom-guide/s1-swap-adding.html)

export CFLAGS="-march=armv7-a -mtune=cortex-a8 -mfpu=neon -mfloat-abi=softfp -O2"

export LDFLAGS="-Wl,-O1 -Wl,--hash-style=gnu"

./bootstrap.angstrom

./configure --enable-gles --enable-omap-overlay --prefix=/usr --sysconfdir=/etc --cache=config.cache --disable-optical-drive

make

make install

Sit back and enjoy, the build process will take a few hours.

Build Instructions (cross compile)
First begin by setting up OpenEmbedded by following this tutorial http://www.angstrom-distribution.org/building-angstrom

While its ok to build xbmc directly now it will take a lot of diskspace so first build dependencies.

MACHINE=beagleboard bitbake libxmu fribidi mpeg2dec ffmpeg samba fontconfig curl libmodplug libmicrohttpd wavpack libmms cmake-native

MACHINE=beagleboard bitbake libsdl-image libsdl-mixer mysql5 sqlite3 libmms faad2 libcdio libpcre boost lzo2 enca avahi libsamplerate0 libxrandr bzip2 virtual/libsdl

Get ahold of the SDK for SGX and put it into downloads and run

MACHINE=beagleboard bitbake virtual/egl

Koen have already provided a bitbake recipe for trunk but if you wish to follow progress on my branch these are the steps you need to take.

If you wish to follow using git just open the xbmc_svn.bb and switch branch to gsoc-2010-beagleboard. The gsoc branch have all the required patches for crosscompilation so no need for patches. Set SRC_URI to:

SRC_URI = "git://xbmc.git.sourceforge.net/gitroot/xbmc/xbmc;protocol=git;branch=gsoc-2010-beagleboard"

Then all thats needed is to follow my branch is to change SRCREV to be the hash sum of the revision you wish to base from, this is kindof annoying to change all the time so I use a local git which I update instead. Here I have cloned git repository from xbmc to /home/topfs/xbmc/ and using the following recipe http://pastebin.com/cF33fPhg

Now its possible to build xbmc using:

MACHINE=beagleboard bitbake xbmc

Optimizing skins for Beagle Board
There are a number of ways to optimize skins, some of which are general and some are specifically target for embedded and GLES both will be covered in the following section.

General optimizations
On desktop its very rare that the GPU is limiting when rendering XBMC but lower resources could mean less heat generation and slower fanspeeds but also less power requirements which could mean longer battery life or cheaper electric bill. With that said some slower built in graphic cards could still use the performance boost in the higher resolutions. Generally speaking more controls being rendered equals more performance needed thus it makes sense to limit redundant controls if possible. A fluid skin is not just something which runs at a fast framerate when loaded but also a skin which loads each window quickly. Its worth considering that a harddrive might not be as fast on the HTPC as it is on the workstation and by far the thing which takes longest to load are images. While it is possible to background load images to get a fast initial load of a window its not very nice if it takes tens of seconds to get it fully finalized. Scaling and positioning textures in accelerated rendering systems is usually almost for free which means using the border tag in skinning to limit the size of a texture could significantly lower the needed load.

Embedded and Beagle Board specific
Up to and including C4 version of Beagle Board the GPU is the harsh bottleneck and to optimize rendering its worth understanding what takes most resources.

On embedded platforms bandwidth is usually very limited, this is true both in uploading data to RAM/GPU and the bandwidth used to manipulate the backbuffer. To create a fluid skin on an embedded platform its thus vital to limit the needed textures to present the skin, this will limit loading time but also rendering. As a suggestion its very wise to use GIMP or Photoshop to create the mockups since using layers is very useful in the development stage but when the skin is meant to be used on an embedded platform unnecessary layers can take quite a lot of resources to render. This holds true especially when multiple layers are used to form one static image, its alot faster to use GIMP or Photoshop to merge down the layers and only use the one in XBMC. Every layer that could be removed will help alot hence it is worth considering skipping dynamic backgrounds for embedded usage and merge together overlays and backgrounds to a single layer for the skin.

On the beagle board showing a few buttons will yield about 60fps in 720p but also having one non-blended background will make the speed 20fps, and this is only with one merged background! Since the SGX graphics core is designed to be used mostly in low resolution situations the textures used are generally small. If a texture is to big the SGX core might not be able to use the texture cache properly, this means that using textures above 512x512 is significantly slower, regardless of the rendering area. So skinners need to think long and hard before using larger textures and if a large texture is required splitting it into pieces with multiple controls can be alot faster. Also usually using power of two textures are faster so as an example using a 1280x720 texture rendered over the same area as four 512x256 textures will be 20fps against 30fps, the resulting picture is the same to the user but with 50% performance increase!.

Using textures with alpha are a very common practice to create nice looking skins, its worth knowing that rendering with blending enable will use more than twice the amount of bandwidth since it needs to read the backbuffer, blend and then write it back. Without blend it only needs to write. Note that the needed bandwidth is on a per pixel basis so avoiding alpha on big images is key, for example backgrounds. XBMC isn't extremely smart at understanding if the image has alpha or not so for non-alpha images use JPG as its guaranteed to work.

Dirty region rendering
Most of my time during this google summer of code was spent on making the rendering system in XBMC being able to only render what had changed. This is a very common technique when there is little available bandwidth and the interface rarely changes over big areas. In theory a dirty region based rendering should allow quite a big performance increase if the interface is close to static. In XBMC this is generally true but given its meant for TV usage and viewed from a distance its quite normal that rather large parts are changing when something happens. Thus the hard limit for performance increase is up to the skinner if he or she can design a skin which only changes in small areas at one time, this could be accomplised if animations are few or small.

How does it work?
When you use XBMC every view or intend you see is called a window. A skin has the ability to form a window how they see fit. A window is made up of controls or groups of controls. A control could be an inanimate object such as a background image or something you interact with, for example a button. This means the skinner have full control what and how everything is displayed in every window. Under the hood XBMC will process each control and if something changes it will mark said controls covering area as dirty. When every control have had the opportunity to mark, XBMC will try to optimize how it should render the possibly overlapping dirty regions. Note that since opengl is normally flipping back and front buffers by pointer, i.e. The backbuffer or drawing board is not the same as the one just presented, XBMC needs to track dirty regions from the presenting buffer as well. Normally opengl is doublebuffered and thus we track dirty regions for 2 renderings.

Leverage dirty region based rendering as a skinner
The core problem dirty region based rendering tries to solve is to only render what has changed. This is only effective if the changing parts are kept small. While without dirty regions a skinner could create a minimal skin which is rather fast, with dirty region based rendering a skinner have the possibility to present a complex skin with the illusion of fast framerate. What is key here is that if a skinner is able to confine changes to small areas the rendering would be incremental over the entire screen. During google summer of code a cost reduction algorithm was devised to allow for small areas being parted over vast distances to occur. For example if a small change happens in the lower left corner and the upper right corner, the entire screen wouldn't be redrawn.

OMAP Overlay
In the desktop segment XBMC utilizes the GPU to do the YUV to RGB conversion, both to offload these calculations from the CPU but also to limit the uploading bandwidth. This is a very optimized approach if the GPU is powerful enough to do this while still presenting the GUI, on the Beagle board this is sadly not the case. Thankfully the OMAP platform beagle board is based on have a hardware accelerated display driver which is capable of doing this conversion, scaling and positioning in hardware. The result meant offloading both CPU and GPU from the strain to do this conversion and in turn being able to do this up to 720p resolutions. Sadly the player in xbmc isn't optimized enough to give 720p playback on beagleboard using just the CPU for decoding. The driver is also capable of blending the resulting RGB with the other layers to produce the illusion of having rendered stuff over the resulting video. Which is needed when displaying for example volume changes or on screen menus for control the video.

Technicals
The overlay reads data from the framebuffers found in /dev/fbX. Multiple overlays can read from the same framebuffer and an overlay can be told to read only from a part of the framebuffer. The display area of the overlay can be larger or smaller then that found in the framebuffer and the overlay will then scale it accordingly. Since the overlay is able to read only part of the data from a framebuffer its possible to create a larger framebuffer than the data its meant to hold to create a front- and backbuffer and switch the location the overlay should read from before presenting. Overlay 1 (video layer) is able to read YUV422 (or RGB) which will be transformed, scaled and positioned before presenting and all is done in hardware. Normally X and SGX is rendering to fb0 which is linked to overlay 0 (graphics layer). Graphics layer is above video layer but interestingly enough by default the display manager sets final alpha to zero on graphics layer when mixing making video layer appearing over graphics. This can be changed however, making it possible to render with alpha to framebuffer 0 and having the ability to choose what will and what will not be above video layer. Note that this is only possible if the framebuffer is set to 32bit BGRA, which it should be by default in Ångström.

Using the Overlay
The overlay is available for use by any user in the group 'video'. Its control and used by posix ioctl calls and by filling up the framebuffer with the data. While its perfectly possible to displaying video using singlebuffering doublebuffering will generate prettier and sturdier code. If we only have singlebuffering we would need to make sure we never write something that the overlay haven't read, and it can be very tricky keeping read and write from the different processes in sync. Doublebuffering does need a larger memory but its usually no problem since we are dealing with YUV. Since we want to limit unnecessary copying a smart solution is to use just one framebuffer for doublebuffering. The framebuffer will be twice as large framebuffer and we only let the overlay ever display half of that framebuffer. This means that we dedicate one part of the framebuffer as frontbuffer and the other part as backbuffer. Just before we are stating that we are ready for displaying we switch the offset the overlay will read from, alas we have switched the front and backbuffers.

Here is a quick walkthrough how to set it up for use
 * Open /dev/fb1
 * Setup memory for the overlay by OMAPFB_SETUP_MEM
 * Map the memory the overlay have created on the framebuffer as rw by using mmap.
 * Setup the screeninfo by FBIOPUT_VSCREENINFO. Here both xbmc and omapfb uses doublebuffering and as such sets up the virtual resolution of the screen to be double the size. The overlay is able to read from differently coded data and we will select YUV422 to allow the overlay to handle this conversion leaving the CPU free to do decoding.
 * Setup the plan by OMAPFB_SETUP_PLANE. Here we define were and how big the output picture should be. This makes both scaling and positioning done by the DSS. This is a method you would want to call at any time when the picture needs to be scaled or positioned differently.

Now when we have setup the overlay, what we need to do now is to feed data into the fb which we have memory mapped. If we have made it doublebuffered we feed only the backbuffer and issue a flip, when the flip have been issued we MUST wait until vsync before filling the new backbuffer otherwise we aren't sure that the data is read and displayed and we could get tearing. If we are doublebuffering we need to before each flip set where our new frontbuffer is (what we just called backbuffer), this is done by calling FBIOPAN_DISPLAY with the correct xoffset and yoffset. To flip the buffer we call OMAPFB_WAITFORGO and then all we need is to wait for vsync.

End words
Looking back to the beginning of this summer XBMC didn't even compile on Ångström and now its possible to at least use XBMC and Beagle Board as a fully working SD box. While there still probably exist lots of rendering optimizations and there still is a need to create a really optimized skin I would like to say I am happy with the result. Before xbmc ran at about 10fps (with default skin) which now with dirty region based rendering turned on it runs at almost 20fps (default skin),. The POC optimized skin which looks essentially the same as the default skin runs at 30fps with dirty region based rendering turned on it hits the goal I was aiming for 720p!. Its also clear that the greatest bottleneck is still the GPU which will be significantly faster on the Beagle Board xM, as such it might be a very viable HD choice for XBMC. Much of the profiling suggests XBMC being heavy on texture usage, which is of no surprise since every control is using it. As such it would probably be worth checking out texture compression, which is used on desktop versions but mostly to limit decode time of the picture. Texturing with the SGX seems to very much dislike large textures and it probably needs to move lots of data, eating cycles, to handle it. This is the main upside with texture compression for the beagleboard but it might be enough to just allow for 16bit textures or even splitting up textures and use multiple polygons and perhaps even a more advanced shader to allow to still batching the rendering. Tests suggest moving a 720p picture into 4 pieces yields as much as 50% increase which is significant!