Difference between revisions of "BeagleBoard/GSoC/2023 Proposal/OpenGLES acceleration for DL"

From eLinux.org
Jump to: navigation, search
(Implementation Details)
(Timeline)
 
(16 intermediate revisions by the same user not shown)
Line 13: Line 13:
 
=Proposal=  
 
=Proposal=  
 
* Completed All the requirements listed on the ideas page.
 
* Completed All the requirements listed on the ideas page.
* The PR request for cross-compilation task: [https://github.com/jadonk/gsoc-application/pull/176 task]
+
* The PR request for cross-compilation [https://github.com/jadonk/gsoc-application/pull/176 task].
  
 
=About you=
 
=About you=
Line 29: Line 29:
 
==Description==
 
==Description==
 
====Overview====
 
====Overview====
The aim of the project is to accelerate as many layers as possible in neural network by using OpenGLES-enabled GPU in BeagleBoard X15/AI-64. I will be using '''Shaders''' to run on the GPU. Shaders are the user-defined program that run on the GPU of the board. The use of shaders for computation can result in significant speedup, as GPUs are designed to process large amounts of data in parallel. Out of various shaders that can be used on the GPU, the shaders I will be using are the '''Compute Shaders'''.  
+
Deep Learning is a subset of Machine Learning which involves use of Neural Network with multiple Layers. Neural networks consist of multiple layers of interconnected nodes, each building upon the previous layer to refine and optimise the prediction.
  
Compute Shaders can be used to accelerate the performance of the YOLO model in the Darknet CNN Framework. They can be used to perform '''parallel-processing''', which will eventually help in performing heavy computations in the Deep Learning Algorithm. This will allow multiple calculations to be performed simultaneously by using some features like CUDA and OpenCL. To use Compute Shaders, we will need to identify which type of layers can be accelerated in the '''YOLO model'''.
+
The main goal of the project is to accelerate as many layers types as possible using APIs such as OpenGLES, Vulkan and Darknet as Deep Learning framemork.
  
I will be adapting '''convolution layer''' in this project. The reason to target this layer is it has the ability to learn and extract hierarchical representations of the input data, such as images. Additionally convolution layer are computationally efficient and can be highly parallelized, making them ideal for acceleration using OpenGLES shaders. By accelerating convolutional layers using compute shaders, we can significantly improve the performance of deep learning models.
+
====Shaders====
 +
Shaders are the user-defined program that run on the GPU of the board. The use of shaders for computation can result in significant speedup, as GPUs are designed to process large amounts of data in parallel. Out of various shaders that can be used on the GPU, the shaders I will be using are the '''Compute Shaders'''.They can be used to perform parallel computations, such as matrix multiplication and convolution, which are often used in deep learning applications. Compute shaders can be written using the GLSL programming language, and can be executed on the GPU using the glDispatchCompute function in the OpenGL API.
  
Once we have identified the layer types that can be accelerated using compute shaders, we can develop optimized shader programs that perform these computations on the OpenGLES-enabled GPU. These shader programs would need to take into account the specific architecture of the GPU and optimize the computations for maximum parallelism. Next, we would integrate the compute shaders into the Darknet CNN framework, which would require modifying the existing code to support the use of compute shaders for these layer types. We would also need to verify that the implementation is correct and benchmark the performance gains achieved by the compute shader-accelerated layers.
 
  
=====Implementation Details=====
+
====Darknet====
* Implementation of this project involves knowledge of Deep Learning, understanding of Neural Networks, YOLO model, Darknet framework, convolution Neural Network and the OpenGLES API.
+
[https://pjreddie.com/darknet/ Darknet] is an open source neural network framework written in C and CUDA. It is fast,compatible, easy to install, and supports CPU and GPU computation. Darknet is used in the project to implement the YOLO object detection and recognition model.
* Reason to use '''YOLOv3''' is that it is the fastest object detection algorithms with high detection accuracy. It uses Darknet-53 which has 53 convolution layers making it powerful.
 
* Also YOLOv3 is easy to implement and can run on variety of platforms like GPUs. It can detect wide range of objects and can handle Intricate environments.
 
* Next step would be to identify the layer for acceleration using OpenGLES shaders. There are various layers that can build Convolutional Neural Networks as mentioned [https://pyimagesearch.com/2021/05/14/convolutional-neural-networks-cnns-and-layer-types/ here]. As mentioned earlier, I will be targeting convolution layers in this project.
 
* The third step is to develop and optimize compute shader programs for the targeted layers. Compute shaders are a type of shader program that can be executed on the GPU. They are highly parallel and can perform computations in parallel on multiple data.
 
* Then, I will be integrating the optimized shaders into the YOLOv3 model pipeline using OpenGLES APIs.
 
* Finally, I will start by testing and evaluating the performance of the accelerated YOLO model. The performance of the model can be evaluated based on its accuracy, speed, and memory usage. Comparing the performance of the accelerated model with the original model can help determine the effectiveness of the optimization techniques used.
 
  
===Timeline===
+
You Only Look Once (YOLO) is a state-of-the-art, real-time object detection algorithm that uses a single neural network to detect objects. The YOLO model consists of multiple convolutional layers that extract features from the input image and several fully connected layers that produce the output of the model. These layers have many parameters that need to be optimised during training to achieve high accuracy in object detection and recognition.
 +
 
 +
In this project, Darknet is used as the deep learning framework to implement the YOLO model and optimise its performance.
 +
 
 +
 
 +
====YOLO Pipeline====
 +
Out of the various YOLO pipelines(YOLO,YOLOv2,YOLOv3,etc), I will be adapting YOLOv3 in this project. YOLOv3 is extremely fast and accurate. In mAP measured at .5 IOU YOLOv3 is on par with Focal Loss but about 4x faster. Moreover, you can easily trade off between speed and accuracy simply by changing the size of the model.
 +
 
 +
The YOLOv3 model consist of various layers such as Convolution layers, Route layer, Up-Sampling layers, Region layer, Maxpool layer etc. Thus we will be performing computations on these layers to accelerate the performance of the YOLOv3 model. To accelerate the performance of the YOLOv3 model, we will utilize the OpenGLES-enabled GPU on the target hardware platform. The GPU can be used to perform the computations required by certain layers in the neural network using parallel processing, which can greatly speed up the processing time.
 +
 
 +
====Vulkan====
 +
Vulkan can be used for general-purpose computing, and provides features such as compute shaders, which can be used to perform complex computations in parallel on GPUs. The project will involve implementing and optimizing compute kernels for various layers of neural networks using Vulkan compute shaders. These kernels will include operations such as convolution, pooling and more. The Vulkan API will be used to manage resources such as buffers and images, as well as to schedule compute shader execution on the GPU.
 +
 
 +
Just like OpenGLES, this new interface describes what the application intends to do, which can lead to better performance and less surprising driver behaviour compared to existing APIs like OpenGLES. Vulkan is a newer API that provides more control and flexibility. It is designed to take advantage of modern GPU hardware and can provide better performance compared to OpenGLES in some cases.
 +
 
 +
One of the key feature of Vulkan is that the compute shader is completely separated from the graphics part of the pipeline.
 +
 
 +
[[File:Vulkan_pipeline_block_diagram.jpg|900px|thumb|center]]
 +
 
 +
With the compute shader stage being detached from the graphics pipeline we'll be able to use it anywhere.
 +
 
 +
* Data type Extension
 +
** By default, the 32 bit floating precision is used for both training and inferencing, which are basically just running a computational graph:
 +
*** Training runs a forward pass, and often times a backward pass to propagate back the gradient
 +
*** Inferencing is just about doing the forward pass
 +
** But both can be done in lower precision types for faster compute time and reduced data storage.
 +
*** The following extensions are available in vulkan:
 +
**** VK_KHR_shader_float16_int8
 +
**** VK_KHR_8bit_storage / VK_KHR_16bit_storage
 +
*** 8 bit integers data types are used for quantized Neural Nets
 +
*** FP16 data types can be used for faster math with gradient rescaling in training
 +
 
 +
* Improved Compute Shader
 +
** New extensions are devised to improve efficiency
 +
*** VK_KHR_workgroup_memory_explicit_layout
 +
*** Allow more efficient data loading into shared memory for further use with efficient matrix multiplication operations.
 +
*** VK_EXT_ML_primitives: Exposes basic primitives used in the main stream Neural Nets as optimized building blocks
 +
 
 +
* Extension Available:
 +
** VK_NV_cooperative_matrix
 +
** Accelerates large, low-precision matrix multiplies
 +
** Exposes high throughput matrix/vector multiplication  units.
 +
** Typically used be convolution / matmul layer in fp16 formats.
 +
** Core compute function for deep learning
 +
** Following Code snippets illustrates how you might employ the extension.
 +
 
 +
<source line lang="C">
 +
//This code performs a matrix multiplication operation
 +
// using cooperative matrices loaded from two input matrices A and B.
 +
 
 +
for (uint chunkK = 0; chunkK < K; chunkK += TILE_K) {
 +
    fcoopmatNV<16, gl_ScopeSubgroup, lM, lK> matA[C_ROWS];
 +
    [[unroll]] for (uint i = 0; i < C_ROWS; ++i) {
 +
        uint gi = TILE_M * tileID.y + lM * i;
 +
        uint gk = chunkK;
 +
        coopMatLoadNV(matA[i], inputA.x, strideA * gi + gk, strideA, false);
 +
    }
 +
    fcoopmatNV<16, gl_ScopeSubgroup, lK, lN> matB;
 +
    [[unroll]] for (uint j = 0; j < C_COLS; ++j) {
 +
        uint gj = TILE_N * tileID.x + lN * j;
 +
        uint gk = chunkK;
 +
        coopMatLoadNV(matB, inputB.x, strideB * gk + gj, strideB, false);
 +
        [[unroll]] for (uint i = 0; i < C_ROWS; ++i) {
 +
            result[i][j] = coopMatMulAddNV(matA[i], matB, result[i][j]);
 +
        }
 +
    }
 +
}
 +
</source>
 +
 +
 
 +
 
 +
 
 +
=====Benefits of Vulkan Compute Shaders=====
 +
* Highly parallelized computation: Compute shaders in Vulkan are designed to execute a large number of parallel computations on GPU hardware, which can provide significant performance benefits over CPU-based computations.
 +
* Flexibility: We can write custom shaders to perform a wide range of compute tasks, including machine learning inference. This allows for greater flexibility in optimising performance and achieving better accuracy.
 +
* Memory access: Compute shaders can access memory resources that are shared with graphics shaders, providing more efficient memory utilization and reducing the need for data transfers between the CPU and GPU.
 +
* Synchronization: Vulkan provides synchronization mechanisms for coordinating access to shared memory resources and ensuring that compute shaders execute in the correct order. This will allow us to take advantage of the parallelism of compute shaders while avoiding race conditions and other synchronization issues.
 +
 
 +
==Implementation Details==
 +
 
 +
====1. Identifying the layer that can benefit from GPU acceleration====
 +
 
 +
'''Convolution layer''':
 +
 
 +
It is  a fundamental building block in deep neural networks. The convolution operation involves sliding a filter or kernel over an input image, computing dot products between the filter and local
 +
patches of the image to produce a feature map.The convolution layer is used extensively in the backbone network to extract high-level features from the input image. By adapting the convolution layer for
 +
acceleration using OpenGLES shaders, we can significantly speed up the computation time and improve the overall performance of the YOLOv3 model on resource-constrained devices.
 +
 
 +
'''Route layer''':
 +
 
 +
The route layer can also be used in the implementation to accelerate the YOLOv3 pipeline using OpenGLES. The route layer is used to concatenate feature maps from different layers. It can concatenate two or more
 +
feature maps along the channel dimension. By doing so, it enables the network to combine features learned from different layers and extract more complex features.
 +
 
 +
'''Up-Sampling layer''':
 +
 
 +
Upsampling layers can be used in the YOLO pipeline to increase the resolution of the feature maps before passing them to subsequent layers. Upsampling can be implemented using various techniques such as bilinear or nearest-neighbor interpolation, or transposed convolution.
 +
 
 +
'''Region layer''':
 +
 
 +
The region layer is an important layer in the YOLOv3 model that is responsible for predicting the object bounding boxes and associated class probabilities.
 +
 
 +
'''Maxpool layer''':
 +
 
 +
The maxpool layer can be used in the YOLO pipeline to downsample the feature maps and reduce their spatial resolution. The maxpool layer can be used to extract the most important features from each local region of the input feature map and reduce its size, thus reducing the computational cost of subsequent layers.
 +
 
 +
 
 +
==== 2. Writing the shader code using the OpenGLES and Vulkan API to perform the computations required by the selected layers on the GPU.====
 +
 
 +
The shader code will need to be optimized for parallel processing.
 +
Here is an example of shader code for a convolution operation using the OpenGLES API:
 +
<source line lang="C">
 +
 
 +
uniform float uKernel[9];
 +
uniform sampler2D uSampler;
 +
uniform vec2 uTextureSize;
 +
 +
varying vec2 vTexCoord;
 +
 +
void main(void)
 +
{
 +
    vec4 sum = vec4(0.0);
 +
    vec2 stepSize = 1.0/(uTextureSize);
 +
 +
    sum += texture2D(uSampler, vec2(vTexCoord.x - stepSize.x, vTexCoord.y - stepSize.y))
 +
            * uKernel[0];
 +
    sum += texture2D(uSampler, vec2(vTexCoord.x, vTexCoord.y - stepSize.y))
 +
            * uKernel[1];
 +
    sum += texture2D(uSampler, vec2(vTexCoord.x + stepSize.x, vTexCoord.y - stepSize.y))
 +
            * uKernel[2];
 +
 +
    sum += texture2D(uSampler, vec2(vTexCoord.x - stepSize.x, vTexCoord.y))
 +
            * uKernel[3];
 +
    sum += texture2D(uSampler, vec2(vTexCoord.x, vTexCoord.y))
 +
            * uKernel[4];
 +
    sum += texture2D(uSampler, vec2(vTexCoord.x + stepSize.x, vTexCoord.y))
 +
            * uKernel[5];
 +
 +
    sum += texture2D(uSampler, vec2(vTexCoord.x - stepSize.x, vTexCoord.y + stepSize.y))
 +
            * uKernel[6];
 +
    sum += texture2D(uSampler, vec2(vTexCoord.x, vTexCoord.y + stepSize.y))
 +
            * uKernel[7];
 +
    sum += texture2D(uSampler, vec2(vTexCoord.x + stepSize.x, vTexCoord.y + stepSize.y))
 +
            * uKernel[8];
 +
 +
    sum.a = 1.0;
 +
 +
    gl_FragColor = sum;
 +
}
 +
</source>
 +
 
 +
The Vulkan APIs Shader Compute Program:
 +
<source line lang="C">
 +
VkShaderModuleCreateInfo createInfo = {};
 +
createInfo.sType = VK_STRUCTURE_TYPE_SHADER_MODULE_CREATE_INFO;
 +
createInfo.codeSize = shaderCode.size();
 +
createInfo.pCode = reinterpret_cast<const uint32_t*>(shaderCode.data());
 +
 
 +
VkShaderModule shaderModule;
 +
if (vkCreateShaderModule(device, &createInfo, nullptr, &shaderModule) != VK_SUCCESS) {
 +
    throw std::runtime_error("Failed to create shader module!");
 +
}
 +
</source>
 +
 
 +
 
 +
==== 3. Integrate the shader code into the Darknet CNN framework, which is used to build the YOLOv3 model.====
 +
This may involve modifying the existing Darknet code to support the OpenGLES API calls.Integrating the shader code into the Darknet CNN framework involves modifying the existing codebase to support the OpenGLES API calls. The modified code allow for the execution of the selected layers on the GPU using the optimized shader code.The goal of integrating the shader code into the Darknet CNN framework is to allow for the efficient execution of the selected layers on the GPU, resulting in faster and more accurate object detection using the YOLOv3 model.
 +
 
 +
 
 +
==== 4. Compile and build the modified Darknet code with the integrated OpenGLES shaders ====
 +
 
 +
1. Installing [https://pjreddie.com/darknet/install/ Dependencies] such as CUDA, OpenCV, etc. </br>
 +
2. Modifying and building the darknet code which involve adding code to the existing darknet file or will be creating new file. </br>
 +
3. Test and deploy the modified code. </br>
 +
 
 +
==== 5. Test the performance of the modified YOLOv3 model with and without GPU acceleration to measure the speed-up achieved by the GPU acceleration.====
 +
 
 +
 
 +
==Timeline==
  
 
{| class="wikitable"
 
{| class="wikitable"
Line 54: Line 225:
 
| Apr 4 || Application Deadline||
 
| Apr 4 || Application Deadline||
 
* Submitting Proposal to the mentors
 
* Submitting Proposal to the mentors
* Building the concept of Convolution Neural network
 
* Understanding darknet interface
 
 
|
 
|
 
|-
 
|-
| Apr 4 - May 4 || Selection Phase ||
+
| May 4 || Selection Phase ||
* I would be catching up with the community, getting familiar with work culture.
+
* Proposal accepted or rejected
* Familiarize  with the X15/AI board, the OpenGLES GPU, and the Darknet CNN framework.
 
 
|-
 
|-
| May 4 - May 10 || GSoC Acceptance ||  
+
 
* Community Bonding and discussing implementation details with mentors.
+
| May 4 - May 10 || Community Bonding ||
* Getting all doubts cleared regarding the project.
+
* Discuss implementation idea with mentors.
* Getting familiar with the work culture.
+
* Discuss the Scope of the project
 
|-
 
|-
| May 10 - May 31 ||College Exams ||  
+
 
* There are college exams during this period so i will focus on exams.
+
| May 10 - June 5 || College Exams ||
 +
* Focus on College Exams
 
|-
 
|-
| June 1 - June 13 || Milestone #1 ||  
+
 
 +
| June 5|| Milestone #1 ||  
 
* Introductory YouTube video
 
* Introductory YouTube video
 
* Develop Conceptual knowledge
 
* Develop Conceptual knowledge
* Identify layers that can be accelerated using OpenGLES and start learning about bench-marking.
+
* Research and familiarize with Vulkan.
 +
* Implementing and benchmarking darknet on the host.
 +
* Reasearch about TIDL Implementation.
 +
 
 
|-
 
|-
| June 14th - June 25th  || Coding Starts ||
+
 
* Implementing the code base and getting thorough with it.
+
|June 12||Milestone #2 ||  
* Optimising the shader code to improve performance.
+
* Learning Vulkan APIs Implementation.
* Improving the implementation efficiency.  
+
* Porting Darknet to the Target Platform (BeagleBoard)
 +
* Research TIDL APIs
 +
* Edge AI
 +
|-
 +
 
 +
| June 19 || Milestone #3 ||
 +
* Sharing the cross-compiled darknet folder to the BBAI-64.
 +
* Benchmarking and verifying the Implementation.
 
|-
 
|-
| July 26th - August 5th|| Milestone #2 ||  
+
 
* Benchmarking and verifying the Implementation.  
+
| July 10 || Milestone #4, Midterm Evaluation ||
* Integrating the OpenGLES accelerated layers in the darknet CNN Framework
+
* Provide the Mid-Term report to the Mentor.
* Verifying and testing the obtained results
 
 
|-
 
|-
| August 6th - August 15th|| Phase 1 Submission ||  
+
 
* Starting with the documentation and submit the work product to the mentors
+
| July 17th|| Milestone #5 ||  
 +
* Work on the feedback received from the mentors.
 +
* Improving the overall program for better output.
 
|-
 
|-
| August 15th - August 30th|| ||  
+
 
* Present the project to the mentors and receive feedback.
+
| July 24th|| Milestone #6||  
* Work on the Feedback received and make necessary changes.  
+
* Implementing the final layer.
 +
* Working on the darknet framework.
 
|-
 
|-
| August 30st- Sept 15|| Final Submission ||  
+
 
* Completing the documentation and summarise the whole project.
+
| July 31st|| Milestone #7 ||  
* Submit the Final Work and Final Mentor Evaluation
+
* Running and Testing on the BeagleBoard X15/AI.
 +
* Work on the feedback received from the mentor.
 
|-
 
|-
|}
 
  
 +
| August 14th || Final Submission  ||
 +
* Completing the final documentation.
 +
* Submit final work product and final mentor evaluation.
 +
* Completion of GSoC
 +
|-
  
 +
|}
  
===Experience and approach===
+
=Experience and approach=
 
This project requires knowledge in Neural Networks, convolution, C/C++, Linux kernel and OpenGLES.  
 
This project requires knowledge in Neural Networks, convolution, C/C++, Linux kernel and OpenGLES.  
 
* I have Previously Worked on the [https://github.com/Pratham-Bot/GPGPU-with-GLES/tree/main GPGPU-WITH-GLES] project. Hence, I have good understanding of OpenGLES APIs, Shaders and Linux Kernels.
 
* I have Previously Worked on the [https://github.com/Pratham-Bot/GPGPU-with-GLES/tree/main GPGPU-WITH-GLES] project. Hence, I have good understanding of OpenGLES APIs, Shaders and Linux Kernels.
* I am well-worsed with different types of GPU-capable shaders and I am aware of which of them would be suitable for this project.
+
* Since Vulkan requires some high level coding knowledge, I am alredy familiar with OpenGLES lamguage and it will be  easy to learn Vulkan.
 +
* I am well-worsed with different types of GPU-capable shaders and I am aware of which of them would be suitable for this project.  
 +
* I have also performed Operations such as Matrix Mulltiplication and transpose of a Matrix.
 
* I have been exploring Neural Networks and Convolutions and have gained sufficient knowledge to start the implementation.  
 
* I have been exploring Neural Networks and Convolutions and have gained sufficient knowledge to start the implementation.  
 
* I also have beaglebone(pocket beagle) and have tried implementing the darknet framework on it.
 
* I also have beaglebone(pocket beagle) and have tried implementing the darknet framework on it.
Line 109: Line 299:
 
* I will keep contributing to the project after GSoC and will be interacting with the community often.
 
* I will keep contributing to the project after GSoC and will be interacting with the community often.
  
===Contingency===
+
=Contingency=
 
If I get through any contingencies, I will refer the following resources:
 
If I get through any contingencies, I will refer the following resources:
 
* I Have list of resources available online. So if I get stuck I will refer those resources.
 
* I Have list of resources available online. So if I get stuck I will refer those resources.
 
* I will use Beagle Slack to communicate with other mentors.
 
* I will use Beagle Slack to communicate with other mentors.
  
===Benefit===
+
=Benefit=
 
* The Performance of the YOLOv3 model is improved which will lead to better object detection.
 
* The Performance of the YOLOv3 model is improved which will lead to better object detection.
 
* Many layers can be accelerated at a time hence the efficiency of the model is improved.
 
* Many layers can be accelerated at a time hence the efficiency of the model is improved.
 
* Memory Usage is reduced by loading the computations on GPU as discussed [https://stackoverflow.com/questions/13303219/reducing-ram-usage-with-regard-to-textures here].
 
* Memory Usage is reduced by loading the computations on GPU as discussed [https://stackoverflow.com/questions/13303219/reducing-ram-usage-with-regard-to-textures here].

Latest revision as of 07:16, 18 June 2023

Proposal for OpenGLES acceleration for DL

Status

This project is currently just a proposal.

Proposal

  • Completed All the requirements listed on the ideas page.
  • The PR request for cross-compilation task.

About you

About your project

Project name: OpenGLES acceleration for DL

Description

Overview

Deep Learning is a subset of Machine Learning which involves use of Neural Network with multiple Layers. Neural networks consist of multiple layers of interconnected nodes, each building upon the previous layer to refine and optimise the prediction.

The main goal of the project is to accelerate as many layers types as possible using APIs such as OpenGLES, Vulkan and Darknet as Deep Learning framemork.

Shaders

Shaders are the user-defined program that run on the GPU of the board. The use of shaders for computation can result in significant speedup, as GPUs are designed to process large amounts of data in parallel. Out of various shaders that can be used on the GPU, the shaders I will be using are the Compute Shaders.They can be used to perform parallel computations, such as matrix multiplication and convolution, which are often used in deep learning applications. Compute shaders can be written using the GLSL programming language, and can be executed on the GPU using the glDispatchCompute function in the OpenGL API.


Darknet

Darknet is an open source neural network framework written in C and CUDA. It is fast,compatible, easy to install, and supports CPU and GPU computation. Darknet is used in the project to implement the YOLO object detection and recognition model.

You Only Look Once (YOLO) is a state-of-the-art, real-time object detection algorithm that uses a single neural network to detect objects. The YOLO model consists of multiple convolutional layers that extract features from the input image and several fully connected layers that produce the output of the model. These layers have many parameters that need to be optimised during training to achieve high accuracy in object detection and recognition.

In this project, Darknet is used as the deep learning framework to implement the YOLO model and optimise its performance.


YOLO Pipeline

Out of the various YOLO pipelines(YOLO,YOLOv2,YOLOv3,etc), I will be adapting YOLOv3 in this project. YOLOv3 is extremely fast and accurate. In mAP measured at .5 IOU YOLOv3 is on par with Focal Loss but about 4x faster. Moreover, you can easily trade off between speed and accuracy simply by changing the size of the model.

The YOLOv3 model consist of various layers such as Convolution layers, Route layer, Up-Sampling layers, Region layer, Maxpool layer etc. Thus we will be performing computations on these layers to accelerate the performance of the YOLOv3 model. To accelerate the performance of the YOLOv3 model, we will utilize the OpenGLES-enabled GPU on the target hardware platform. The GPU can be used to perform the computations required by certain layers in the neural network using parallel processing, which can greatly speed up the processing time.

Vulkan

Vulkan can be used for general-purpose computing, and provides features such as compute shaders, which can be used to perform complex computations in parallel on GPUs. The project will involve implementing and optimizing compute kernels for various layers of neural networks using Vulkan compute shaders. These kernels will include operations such as convolution, pooling and more. The Vulkan API will be used to manage resources such as buffers and images, as well as to schedule compute shader execution on the GPU.

Just like OpenGLES, this new interface describes what the application intends to do, which can lead to better performance and less surprising driver behaviour compared to existing APIs like OpenGLES. Vulkan is a newer API that provides more control and flexibility. It is designed to take advantage of modern GPU hardware and can provide better performance compared to OpenGLES in some cases.

One of the key feature of Vulkan is that the compute shader is completely separated from the graphics part of the pipeline.

Vulkan pipeline block diagram.jpg

With the compute shader stage being detached from the graphics pipeline we'll be able to use it anywhere.

  • Data type Extension
    • By default, the 32 bit floating precision is used for both training and inferencing, which are basically just running a computational graph:
      • Training runs a forward pass, and often times a backward pass to propagate back the gradient
      • Inferencing is just about doing the forward pass
    • But both can be done in lower precision types for faster compute time and reduced data storage.
      • The following extensions are available in vulkan:
        • VK_KHR_shader_float16_int8
        • VK_KHR_8bit_storage / VK_KHR_16bit_storage
      • 8 bit integers data types are used for quantized Neural Nets
      • FP16 data types can be used for faster math with gradient rescaling in training
  • Improved Compute Shader
    • New extensions are devised to improve efficiency
      • VK_KHR_workgroup_memory_explicit_layout
      • Allow more efficient data loading into shared memory for further use with efficient matrix multiplication operations.
      • VK_EXT_ML_primitives: Exposes basic primitives used in the main stream Neural Nets as optimized building blocks
  • Extension Available:
    • VK_NV_cooperative_matrix
    • Accelerates large, low-precision matrix multiplies
    • Exposes high throughput matrix/vector multiplication units.
    • Typically used be convolution / matmul layer in fp16 formats.
    • Core compute function for deep learning
    • Following Code snippets illustrates how you might employ the extension.
 1 //This code performs a matrix multiplication operation 
 2 // using cooperative matrices loaded from two input matrices A and B.
 3 
 4 for (uint chunkK = 0; chunkK < K; chunkK += TILE_K) {
 5     fcoopmatNV<16, gl_ScopeSubgroup, lM, lK> matA[C_ROWS];
 6     [[unroll]] for (uint i = 0; i < C_ROWS; ++i) {
 7         uint gi = TILE_M * tileID.y + lM * i;
 8         uint gk = chunkK;
 9         coopMatLoadNV(matA[i], inputA.x, strideA * gi + gk, strideA, false);
10     }
11     fcoopmatNV<16, gl_ScopeSubgroup, lK, lN> matB;
12     [[unroll]] for (uint j = 0; j < C_COLS; ++j) {
13         uint gj = TILE_N * tileID.x + lN * j;
14         uint gk = chunkK;
15         coopMatLoadNV(matB, inputB.x, strideB * gk + gj, strideB, false);
16         [[unroll]] for (uint i = 0; i < C_ROWS; ++i) {
17             result[i][j] = coopMatMulAddNV(matA[i], matB, result[i][j]);
18         }
19     }
20 }



Benefits of Vulkan Compute Shaders
  • Highly parallelized computation: Compute shaders in Vulkan are designed to execute a large number of parallel computations on GPU hardware, which can provide significant performance benefits over CPU-based computations.
  • Flexibility: We can write custom shaders to perform a wide range of compute tasks, including machine learning inference. This allows for greater flexibility in optimising performance and achieving better accuracy.
  • Memory access: Compute shaders can access memory resources that are shared with graphics shaders, providing more efficient memory utilization and reducing the need for data transfers between the CPU and GPU.
  • Synchronization: Vulkan provides synchronization mechanisms for coordinating access to shared memory resources and ensuring that compute shaders execute in the correct order. This will allow us to take advantage of the parallelism of compute shaders while avoiding race conditions and other synchronization issues.

Implementation Details

1. Identifying the layer that can benefit from GPU acceleration

Convolution layer:

It is a fundamental building block in deep neural networks. The convolution operation involves sliding a filter or kernel over an input image, computing dot products between the filter and local patches of the image to produce a feature map.The convolution layer is used extensively in the backbone network to extract high-level features from the input image. By adapting the convolution layer for acceleration using OpenGLES shaders, we can significantly speed up the computation time and improve the overall performance of the YOLOv3 model on resource-constrained devices.

Route layer:

The route layer can also be used in the implementation to accelerate the YOLOv3 pipeline using OpenGLES. The route layer is used to concatenate feature maps from different layers. It can concatenate two or more feature maps along the channel dimension. By doing so, it enables the network to combine features learned from different layers and extract more complex features.

Up-Sampling layer:

Upsampling layers can be used in the YOLO pipeline to increase the resolution of the feature maps before passing them to subsequent layers. Upsampling can be implemented using various techniques such as bilinear or nearest-neighbor interpolation, or transposed convolution.

Region layer:

The region layer is an important layer in the YOLOv3 model that is responsible for predicting the object bounding boxes and associated class probabilities.

Maxpool layer:

The maxpool layer can be used in the YOLO pipeline to downsample the feature maps and reduce their spatial resolution. The maxpool layer can be used to extract the most important features from each local region of the input feature map and reduce its size, thus reducing the computational cost of subsequent layers.


2. Writing the shader code using the OpenGLES and Vulkan API to perform the computations required by the selected layers on the GPU.

The shader code will need to be optimized for parallel processing. Here is an example of shader code for a convolution operation using the OpenGLES API:

 1 uniform float uKernel[9];
 2 uniform sampler2D uSampler;
 3 uniform vec2 uTextureSize;
 4  
 5 varying vec2 vTexCoord;
 6  
 7 void main(void)
 8 {
 9     vec4 sum = vec4(0.0);
10     vec2 stepSize = 1.0/(uTextureSize);
11  
12     sum += texture2D(uSampler, vec2(vTexCoord.x - stepSize.x, vTexCoord.y - stepSize.y))
13             * uKernel[0];
14     sum += texture2D(uSampler, vec2(vTexCoord.x, vTexCoord.y - stepSize.y))
15             * uKernel[1];
16     sum += texture2D(uSampler, vec2(vTexCoord.x + stepSize.x, vTexCoord.y - stepSize.y))
17             * uKernel[2];
18  
19     sum += texture2D(uSampler, vec2(vTexCoord.x - stepSize.x, vTexCoord.y))
20             * uKernel[3];
21     sum += texture2D(uSampler, vec2(vTexCoord.x, vTexCoord.y))
22             * uKernel[4];
23     sum += texture2D(uSampler, vec2(vTexCoord.x + stepSize.x, vTexCoord.y))
24             * uKernel[5];
25  
26     sum += texture2D(uSampler, vec2(vTexCoord.x - stepSize.x, vTexCoord.y + stepSize.y))
27             * uKernel[6];
28     sum += texture2D(uSampler, vec2(vTexCoord.x, vTexCoord.y + stepSize.y))
29             * uKernel[7];
30     sum += texture2D(uSampler, vec2(vTexCoord.x + stepSize.x, vTexCoord.y + stepSize.y))
31             * uKernel[8];
32  
33     sum.a = 1.0;
34  
35     gl_FragColor = sum;
36 }

The Vulkan APIs Shader Compute Program:

1 VkShaderModuleCreateInfo createInfo = {};
2 createInfo.sType = VK_STRUCTURE_TYPE_SHADER_MODULE_CREATE_INFO;
3 createInfo.codeSize = shaderCode.size();
4 createInfo.pCode = reinterpret_cast<const uint32_t*>(shaderCode.data());
5 
6 VkShaderModule shaderModule;
7 if (vkCreateShaderModule(device, &createInfo, nullptr, &shaderModule) != VK_SUCCESS) {
8     throw std::runtime_error("Failed to create shader module!");
9 }


3. Integrate the shader code into the Darknet CNN framework, which is used to build the YOLOv3 model.

This may involve modifying the existing Darknet code to support the OpenGLES API calls.Integrating the shader code into the Darknet CNN framework involves modifying the existing codebase to support the OpenGLES API calls. The modified code allow for the execution of the selected layers on the GPU using the optimized shader code.The goal of integrating the shader code into the Darknet CNN framework is to allow for the efficient execution of the selected layers on the GPU, resulting in faster and more accurate object detection using the YOLOv3 model.


4. Compile and build the modified Darknet code with the integrated OpenGLES shaders

1. Installing Dependencies such as CUDA, OpenCV, etc.
2. Modifying and building the darknet code which involve adding code to the existing darknet file or will be creating new file.
3. Test and deploy the modified code.

5. Test the performance of the modified YOLOv3 model with and without GPU acceleration to measure the speed-up achieved by the GPU acceleration.

Timeline

Date Status Details
Apr 4 Application Deadline
  • Submitting Proposal to the mentors
May 4 Selection Phase
  • Proposal accepted or rejected
May 4 - May 10 Community Bonding
  • Discuss implementation idea with mentors.
  • Discuss the Scope of the project
May 10 - June 5 College Exams
  • Focus on College Exams
June 5 Milestone #1
  • Introductory YouTube video
  • Develop Conceptual knowledge
  • Research and familiarize with Vulkan.
  • Implementing and benchmarking darknet on the host.
  • Reasearch about TIDL Implementation.
June 12 Milestone #2
  • Learning Vulkan APIs Implementation.
  • Porting Darknet to the Target Platform (BeagleBoard)
  • Research TIDL APIs
  • Edge AI
June 19 Milestone #3
  • Sharing the cross-compiled darknet folder to the BBAI-64.
  • Benchmarking and verifying the Implementation.
July 10 Milestone #4, Midterm Evaluation
  • Provide the Mid-Term report to the Mentor.
July 17th Milestone #5
  • Work on the feedback received from the mentors.
  • Improving the overall program for better output.
July 24th Milestone #6
  • Implementing the final layer.
  • Working on the darknet framework.
July 31st Milestone #7
  • Running and Testing on the BeagleBoard X15/AI.
  • Work on the feedback received from the mentor.
August 14th Final Submission
  • Completing the final documentation.
  • Submit final work product and final mentor evaluation.
  • Completion of GSoC

Experience and approach

This project requires knowledge in Neural Networks, convolution, C/C++, Linux kernel and OpenGLES.

  • I have Previously Worked on the GPGPU-WITH-GLES project. Hence, I have good understanding of OpenGLES APIs, Shaders and Linux Kernels.
  • Since Vulkan requires some high level coding knowledge, I am alredy familiar with OpenGLES lamguage and it will be easy to learn Vulkan.
  • I am well-worsed with different types of GPU-capable shaders and I am aware of which of them would be suitable for this project.
  • I have also performed Operations such as Matrix Mulltiplication and transpose of a Matrix.
  • I have been exploring Neural Networks and Convolutions and have gained sufficient knowledge to start the implementation.
  • I also have beaglebone(pocket beagle) and have tried implementing the darknet framework on it.
  • I am passionate Open Source enthusiast and I will do the work wholeheartedly. I have my commitment to GSoC and I would do everything in my power to finish the project idea within the allotted time.
  • I will keep contributing to the project after GSoC and will be interacting with the community often.

Contingency

If I get through any contingencies, I will refer the following resources:

  • I Have list of resources available online. So if I get stuck I will refer those resources.
  • I will use Beagle Slack to communicate with other mentors.

Benefit

  • The Performance of the YOLOv3 model is improved which will lead to better object detection.
  • Many layers can be accelerated at a time hence the efficiency of the model is improved.
  • Memory Usage is reduced by loading the computations on GPU as discussed here.