BeagleBoard/GSoC/2021 Proposal/GPGPU with GLES

From eLinux.org
Jump to: navigation, search


[GPGPU with GLES]

About Student: Jakub Duchniewicz
Mentors: Hunyue Yau
Code: not yet created!
Wiki: https://elinux.org/BeagleBoard/GSoC/2021_Proposal/GPGPU_with_GLES
GSoC: [1]

Status

Discussing the tentative ideas with Hunyue Yau and others on #beagle-gsoc IRC.

About you

IRC: jduchniewicz
Github: JDuchniewicz
School: University of Turku/KTH Royal Institute of Technology
Country: Finland/Sweden/Poland
Primary language: Polish
Typical work hours: 8AM-5PM CET
Previous GSoC participation: Participating in GSoC, especially with BeagleBoard would further develop my software and hardware skills and help me apply my current knowledge for the mutual benefit of the open source community. I planned to do the YOLO project, but after spending several days researching and preparing the [2] I found it is impossible to do on current BBAI/X15, hence I want to do another way of heterogeneous computing with GPGPU!

About your project

Project name: GPGPU with GLES computing framework

Description

Since BeagleBoards are heterogeneous platforms, why not use them to their full extent? Apparently for the time being the GPU block lies mostly useless and cannot assist with any computations due to non-technical reasons. There exists a library made by the IP manufacturer but is not open source, and thus does not fit into the BB's philosophy of open SW/HW. The IP manufacturer(Imagination Tech) has even a series of teaching lectures of the subject, available [3]. Normally, OpenCL would be used for accelerating on the GPU, but taking the limitations discussed above it is impossible, yet there is another way! GPGPU acceleration with OpenGLES.

The GPU accelerators from PowerVR (series SGX500) are available in all BB's and a unified acceleration framework for them is the topic of this proposal. The targeted version of OpenGL ES is 2.0, but extending it to the future versions will not be a problem as the newer versions are backwards compatible. Moreover, it could open a way for new ways of computation acceleration, such as integer and 32-bit floating-point operations. The API of the library will allow for performing various mathematical operations (limited to a few for the GSoC part of the project), and will return the results to the user in the same format they input the data in.

Apart from the library, the project will explore the performance gains when using the GPU (compared to the CPU) and provide the user with guidelines on when it is feasible to put the BB's GPU accelerator to use and when to refrain from it. This could be a good resource for a training session as mentioned later in the proposal.

Implementation

The system will comprise two layers: the API layer and the implementation layer. The API will be a set of extensible user calls for accelerating various functions, while the implementation will be the part responsible for mapping the user data to the GPU via a shader program.

TIDL higher level API //change it lol

The picture above presents the main components of the system.

The implementation is described along the user flow below.

  1. User creates the data buffers holding the data to be processed and calls relevant API.
  2. The EGL context is established and a shader program is attached
  3. The data is translated and loaded to the GPU as a texture
  4. OpenGL renders the frame
  5. The data is read out and translated to the form chosen by the user
  6. The context is closed and cleaned up
  7. The data is returned to the user

The library should allow for both single-shot computations and reusing the established GL context and feeding the shader with new data every frame. The alternative flow (with continuous flow of the data), establishes relevant contexts and asks only for the user input each frame.


Alternative ideas

  • There are usually two GPU's - the library may be extended to make use of both of them and work in a streaming fashion
  • Incorporating the DSPs into the acceleration by means of OpenCL
  • Or making this part of the OpenCL acceleration scheme once it is proven to be stable and performant

Detailed Implementation

API Overview

Expected Performance

Basing on previous deployments of YOLOv3 on various embedded systems (Jetson Nano, Raspberry Pi), the expected results are:

  • 1 FPS with full acceleration, or
  • 2-3 FPS with full acceleration if the parallelization can be done in a smart way

Expected result is a non-stalling experience of using the API and processing frames. The user can normally use all the CPU resources during the frame processing, but the EVEs and DSPs are blocked.

Action Items

  • setup the environment
  • run sample models on the BB
  • obtain a deployable YOLOv3 model
  • load the model to TIDL
  • create initial scaffolding
  • write basic implementation using one EO
  • partition the model and create a graph
  • parallelize the implementation
  • research dividing the image into smaller segments for parallelization
  • polish the implementation
  • record video 1 + 2
  • write up about the process on my blog
  • benchmark different approachesagleBone and related boards all have a built in GPU that is often unused. It would be useful to create samples showing how it can be used for GPGPU purposes.Since there is no OpenCL drivers, this would need to be done with OpenGLES 2.0. The samples should take data from a
  • polish the API and make it extensible
  • write a sample user application(s)

Timeline

During 25.07.21-08.08.21 I have a summer camp from my study programme and will be probably occupied for a half of the day. The camp will most likely be held online though.

Date Milestone Action Items
13.04.21-17.05.21 Pre-work
  • Expand knowledge on YOLOv3 architecture
  • Try parallelizing it in software (TODO: need help here)
  • Learn some NEON + DSP + OpenCL programming
18.05.21-07.06.21 Community Bonding
  • Familiarize myself with the community
  • Experiment with BB
14.06.21 Milestone #1
  • Introductory YouTube video
  • Setup the environment
  • Write first blog entry - "Part 1: Game plan"
21.06.21 Milestone #2
  • Run sample models on the BB
  • Obtain a deployable YOLOv3 model
  • Load the model to TIDL
28.06.21 Milestone #3
  • Create initial scaffolding
  • Write basic implementation using one EO
  • Write documentation and (tests???)
5.07.21 Milestone #4
  • Partition the model and create a graph
  • Parallelize the implementation (Pipelined EOs)
12.07.21 Milestone #5
  • Research dividing the image into smaller segments for parallelization
  • If possible implement it, if not look for optimizations
  • Write second blog post - "Part 2: Implementation"
  • Evaluate the mentor
19.07.21 Milestone #6
  • Write a sample user application for single image and streaming
26.07.21 Milestone #7
  • Summer camp, mostly collect feedback for the project
  • Less brain-intensive tasks, documentation, benchmarking
31.07.21 Milestone #8
  • Write third blog post - "Part 3: Optimization and Benchmarks"
  • Polish the implementation
7.8.21 Milestone #9
  • Polish the API and make it extensible
  • Prepare materials for the video
  • Start writing up the summary
14.8.21 Milestone #10
  • Finish the project summary
  • Final YouTube video
24.08.21 Feedback time
  • Complete feedback form for the mentor
31.08.21 Results announced
  • Celebrate the ending and rejoice

k proved it is possible to accelerate all layers of the YOLOv3 (what about upsample??) as shown in the paper.


Experience and approach

I have strong programming background in the area of embedded Linux/operating systems as a Junior Software Engineer in Samsung Electronics during December 2017-March 2020. Additionally I have developed a game engine (| PolyEngine) in C++ during this time and gave some talks on modern C++ during my time as a Vice-President of Game Development Student Group "Polygon".

Apart from that, I have completed my Bachelors degree at Warsaw University of Technology successfully defending my thesis titled: | FPGA Based Hardware Accelerator for Musical Synthesis for Linux System. In this system I created a polyphonic musical synthesizer capable of producing various waveforms in Verilog code and deployed it on a De0 Nano SoC FPGA. Additionally I wrote two kernel drivers - one encompassed ALSA sound device and was responsible for proper synchronization of DMA transfers.

I am familiar with Deep Learning concepts and basics of Computer Vision. During my studies at UTU I achieved the maximal grades for my subjects, excelling at Navigation Systems for Robotics and Hardware accelerators for AI.

I have some experience working with OpenGL, mostly learning it for the programming engine needs and for personal benefit. Since this project does not require in-depth knowledge of it, but rather to create an abstraction over the OpenGL ES bindings and perform the necessary data conversions and extractions inside the API. This requires skills I do already possess and have proficiency using.

In my professional work, many times I had to complete various tasks under time pressure and choose the proper task scoping. Basing on this experience I believe that this task is deliverable in the mentioned time-frame.

Contingency

Since I am used to tackling seemingly insurmountable challenges, I will first of all keep calm and try to come up with alternative approach if I get stuck along the way. The internet is a vast ocean of knowledge and time and again I received help from benevolent strangers from reddit or other forums. Since I believe that humans are species, which solve problems in the best way collaboratively, I will contact #beagle, #beagle-gsoc and relevant subreddits (I received tremendous help on /r/FPGA, /r/embedded and /r/askelectronics in the past).

If all fails I may be able be forced to change my approach and backtrack, but this will not be a big problem, because the knowledge won't be lost and it will only make my future approaches better. Alternatively, I can focus on documenting my progress in a form of blogposts and videos while waiting for my mentor to come back to cyberspace.

With this project I see less problems than with the YOLO one, mainly because the track has been explored by Hunyue and because the OpenGL ES is a much wider used framework than TIDL. //todo resources

Thes resources used for this proposal, and for contingency:

  • OpenGL ES problems -
  • Shader programming
  • General design?

Benefit

As the developers use BB's for computationally intensive tasks as DL inference or ML calculations, having a library which will allow them for offloading part of their computations to yet unused parts of the board would be beneficial. This way, the work may be parallelized and the BeagleBoard is a truly heterogeneous platform, allowing for using all of its components!

Educating the developers when to use and when to not use the GPU is also a valuable gain for the community, as not all BB developers are familiar with the limitations of the CPU calculations. This could be a tempting topic for an online training/live Q&A session. Having a reusable library, abstracting the low-level details of OpenGLES would reduce the hassle the programmers have to go to achieve acceleration.


Due to the same non technical issues, the GPU is rarely if ever used on the BeagleBoard. Support for current graphics output has been abysmal. This means we have an IP block that can do computations sitting idle. Even if it is not extremely fast, this can offload the ARM CPU making the Beagle's a truely asymetric multiprocessor SoC. 

~ Hunyue Yau aka ds2

Misc

The qualification PR is available here.