Difference between revisions of "BeagleBoard/GSoC/2023 Proposal/OpenGLES acceleration for DL"
(→Implementation Details) |
(→Timeline) |
||
(16 intermediate revisions by the same user not shown) | |||
Line 13: | Line 13: | ||
=Proposal= | =Proposal= | ||
* Completed All the requirements listed on the ideas page. | * Completed All the requirements listed on the ideas page. | ||
− | * The PR request for cross-compilation | + | * The PR request for cross-compilation [https://github.com/jadonk/gsoc-application/pull/176 task]. |
=About you= | =About you= | ||
Line 29: | Line 29: | ||
==Description== | ==Description== | ||
====Overview==== | ====Overview==== | ||
− | + | Deep Learning is a subset of Machine Learning which involves use of Neural Network with multiple Layers. Neural networks consist of multiple layers of interconnected nodes, each building upon the previous layer to refine and optimise the prediction. | |
− | + | The main goal of the project is to accelerate as many layers types as possible using APIs such as OpenGLES, Vulkan and Darknet as Deep Learning framemork. | |
− | I will be | + | ====Shaders==== |
+ | Shaders are the user-defined program that run on the GPU of the board. The use of shaders for computation can result in significant speedup, as GPUs are designed to process large amounts of data in parallel. Out of various shaders that can be used on the GPU, the shaders I will be using are the '''Compute Shaders'''.They can be used to perform parallel computations, such as matrix multiplication and convolution, which are often used in deep learning applications. Compute shaders can be written using the GLSL programming language, and can be executed on the GPU using the glDispatchCompute function in the OpenGL API. | ||
− | |||
− | ==== | + | ====Darknet==== |
− | + | [https://pjreddie.com/darknet/ Darknet] is an open source neural network framework written in C and CUDA. It is fast,compatible, easy to install, and supports CPU and GPU computation. Darknet is used in the project to implement the YOLO object detection and recognition model. | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | ===Timeline | + | You Only Look Once (YOLO) is a state-of-the-art, real-time object detection algorithm that uses a single neural network to detect objects. The YOLO model consists of multiple convolutional layers that extract features from the input image and several fully connected layers that produce the output of the model. These layers have many parameters that need to be optimised during training to achieve high accuracy in object detection and recognition. |
+ | |||
+ | In this project, Darknet is used as the deep learning framework to implement the YOLO model and optimise its performance. | ||
+ | |||
+ | |||
+ | ====YOLO Pipeline==== | ||
+ | Out of the various YOLO pipelines(YOLO,YOLOv2,YOLOv3,etc), I will be adapting YOLOv3 in this project. YOLOv3 is extremely fast and accurate. In mAP measured at .5 IOU YOLOv3 is on par with Focal Loss but about 4x faster. Moreover, you can easily trade off between speed and accuracy simply by changing the size of the model. | ||
+ | |||
+ | The YOLOv3 model consist of various layers such as Convolution layers, Route layer, Up-Sampling layers, Region layer, Maxpool layer etc. Thus we will be performing computations on these layers to accelerate the performance of the YOLOv3 model. To accelerate the performance of the YOLOv3 model, we will utilize the OpenGLES-enabled GPU on the target hardware platform. The GPU can be used to perform the computations required by certain layers in the neural network using parallel processing, which can greatly speed up the processing time. | ||
+ | |||
+ | ====Vulkan==== | ||
+ | Vulkan can be used for general-purpose computing, and provides features such as compute shaders, which can be used to perform complex computations in parallel on GPUs. The project will involve implementing and optimizing compute kernels for various layers of neural networks using Vulkan compute shaders. These kernels will include operations such as convolution, pooling and more. The Vulkan API will be used to manage resources such as buffers and images, as well as to schedule compute shader execution on the GPU. | ||
+ | |||
+ | Just like OpenGLES, this new interface describes what the application intends to do, which can lead to better performance and less surprising driver behaviour compared to existing APIs like OpenGLES. Vulkan is a newer API that provides more control and flexibility. It is designed to take advantage of modern GPU hardware and can provide better performance compared to OpenGLES in some cases. | ||
+ | |||
+ | One of the key feature of Vulkan is that the compute shader is completely separated from the graphics part of the pipeline. | ||
+ | |||
+ | [[File:Vulkan_pipeline_block_diagram.jpg|900px|thumb|center]] | ||
+ | |||
+ | With the compute shader stage being detached from the graphics pipeline we'll be able to use it anywhere. | ||
+ | |||
+ | * Data type Extension | ||
+ | ** By default, the 32 bit floating precision is used for both training and inferencing, which are basically just running a computational graph: | ||
+ | *** Training runs a forward pass, and often times a backward pass to propagate back the gradient | ||
+ | *** Inferencing is just about doing the forward pass | ||
+ | ** But both can be done in lower precision types for faster compute time and reduced data storage. | ||
+ | *** The following extensions are available in vulkan: | ||
+ | **** VK_KHR_shader_float16_int8 | ||
+ | **** VK_KHR_8bit_storage / VK_KHR_16bit_storage | ||
+ | *** 8 bit integers data types are used for quantized Neural Nets | ||
+ | *** FP16 data types can be used for faster math with gradient rescaling in training | ||
+ | |||
+ | * Improved Compute Shader | ||
+ | ** New extensions are devised to improve efficiency | ||
+ | *** VK_KHR_workgroup_memory_explicit_layout | ||
+ | *** Allow more efficient data loading into shared memory for further use with efficient matrix multiplication operations. | ||
+ | *** VK_EXT_ML_primitives: Exposes basic primitives used in the main stream Neural Nets as optimized building blocks | ||
+ | |||
+ | * Extension Available: | ||
+ | ** VK_NV_cooperative_matrix | ||
+ | ** Accelerates large, low-precision matrix multiplies | ||
+ | ** Exposes high throughput matrix/vector multiplication units. | ||
+ | ** Typically used be convolution / matmul layer in fp16 formats. | ||
+ | ** Core compute function for deep learning | ||
+ | ** Following Code snippets illustrates how you might employ the extension. | ||
+ | |||
+ | <source line lang="C"> | ||
+ | //This code performs a matrix multiplication operation | ||
+ | // using cooperative matrices loaded from two input matrices A and B. | ||
+ | |||
+ | for (uint chunkK = 0; chunkK < K; chunkK += TILE_K) { | ||
+ | fcoopmatNV<16, gl_ScopeSubgroup, lM, lK> matA[C_ROWS]; | ||
+ | [[unroll]] for (uint i = 0; i < C_ROWS; ++i) { | ||
+ | uint gi = TILE_M * tileID.y + lM * i; | ||
+ | uint gk = chunkK; | ||
+ | coopMatLoadNV(matA[i], inputA.x, strideA * gi + gk, strideA, false); | ||
+ | } | ||
+ | fcoopmatNV<16, gl_ScopeSubgroup, lK, lN> matB; | ||
+ | [[unroll]] for (uint j = 0; j < C_COLS; ++j) { | ||
+ | uint gj = TILE_N * tileID.x + lN * j; | ||
+ | uint gk = chunkK; | ||
+ | coopMatLoadNV(matB, inputB.x, strideB * gk + gj, strideB, false); | ||
+ | [[unroll]] for (uint i = 0; i < C_ROWS; ++i) { | ||
+ | result[i][j] = coopMatMulAddNV(matA[i], matB, result[i][j]); | ||
+ | } | ||
+ | } | ||
+ | } | ||
+ | </source> | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | =====Benefits of Vulkan Compute Shaders===== | ||
+ | * Highly parallelized computation: Compute shaders in Vulkan are designed to execute a large number of parallel computations on GPU hardware, which can provide significant performance benefits over CPU-based computations. | ||
+ | * Flexibility: We can write custom shaders to perform a wide range of compute tasks, including machine learning inference. This allows for greater flexibility in optimising performance and achieving better accuracy. | ||
+ | * Memory access: Compute shaders can access memory resources that are shared with graphics shaders, providing more efficient memory utilization and reducing the need for data transfers between the CPU and GPU. | ||
+ | * Synchronization: Vulkan provides synchronization mechanisms for coordinating access to shared memory resources and ensuring that compute shaders execute in the correct order. This will allow us to take advantage of the parallelism of compute shaders while avoiding race conditions and other synchronization issues. | ||
+ | |||
+ | ==Implementation Details== | ||
+ | |||
+ | ====1. Identifying the layer that can benefit from GPU acceleration==== | ||
+ | |||
+ | '''Convolution layer''': | ||
+ | |||
+ | It is a fundamental building block in deep neural networks. The convolution operation involves sliding a filter or kernel over an input image, computing dot products between the filter and local | ||
+ | patches of the image to produce a feature map.The convolution layer is used extensively in the backbone network to extract high-level features from the input image. By adapting the convolution layer for | ||
+ | acceleration using OpenGLES shaders, we can significantly speed up the computation time and improve the overall performance of the YOLOv3 model on resource-constrained devices. | ||
+ | |||
+ | '''Route layer''': | ||
+ | |||
+ | The route layer can also be used in the implementation to accelerate the YOLOv3 pipeline using OpenGLES. The route layer is used to concatenate feature maps from different layers. It can concatenate two or more | ||
+ | feature maps along the channel dimension. By doing so, it enables the network to combine features learned from different layers and extract more complex features. | ||
+ | |||
+ | '''Up-Sampling layer''': | ||
+ | |||
+ | Upsampling layers can be used in the YOLO pipeline to increase the resolution of the feature maps before passing them to subsequent layers. Upsampling can be implemented using various techniques such as bilinear or nearest-neighbor interpolation, or transposed convolution. | ||
+ | |||
+ | '''Region layer''': | ||
+ | |||
+ | The region layer is an important layer in the YOLOv3 model that is responsible for predicting the object bounding boxes and associated class probabilities. | ||
+ | |||
+ | '''Maxpool layer''': | ||
+ | |||
+ | The maxpool layer can be used in the YOLO pipeline to downsample the feature maps and reduce their spatial resolution. The maxpool layer can be used to extract the most important features from each local region of the input feature map and reduce its size, thus reducing the computational cost of subsequent layers. | ||
+ | |||
+ | |||
+ | ==== 2. Writing the shader code using the OpenGLES and Vulkan API to perform the computations required by the selected layers on the GPU.==== | ||
+ | |||
+ | The shader code will need to be optimized for parallel processing. | ||
+ | Here is an example of shader code for a convolution operation using the OpenGLES API: | ||
+ | <source line lang="C"> | ||
+ | |||
+ | uniform float uKernel[9]; | ||
+ | uniform sampler2D uSampler; | ||
+ | uniform vec2 uTextureSize; | ||
+ | |||
+ | varying vec2 vTexCoord; | ||
+ | |||
+ | void main(void) | ||
+ | { | ||
+ | vec4 sum = vec4(0.0); | ||
+ | vec2 stepSize = 1.0/(uTextureSize); | ||
+ | |||
+ | sum += texture2D(uSampler, vec2(vTexCoord.x - stepSize.x, vTexCoord.y - stepSize.y)) | ||
+ | * uKernel[0]; | ||
+ | sum += texture2D(uSampler, vec2(vTexCoord.x, vTexCoord.y - stepSize.y)) | ||
+ | * uKernel[1]; | ||
+ | sum += texture2D(uSampler, vec2(vTexCoord.x + stepSize.x, vTexCoord.y - stepSize.y)) | ||
+ | * uKernel[2]; | ||
+ | |||
+ | sum += texture2D(uSampler, vec2(vTexCoord.x - stepSize.x, vTexCoord.y)) | ||
+ | * uKernel[3]; | ||
+ | sum += texture2D(uSampler, vec2(vTexCoord.x, vTexCoord.y)) | ||
+ | * uKernel[4]; | ||
+ | sum += texture2D(uSampler, vec2(vTexCoord.x + stepSize.x, vTexCoord.y)) | ||
+ | * uKernel[5]; | ||
+ | |||
+ | sum += texture2D(uSampler, vec2(vTexCoord.x - stepSize.x, vTexCoord.y + stepSize.y)) | ||
+ | * uKernel[6]; | ||
+ | sum += texture2D(uSampler, vec2(vTexCoord.x, vTexCoord.y + stepSize.y)) | ||
+ | * uKernel[7]; | ||
+ | sum += texture2D(uSampler, vec2(vTexCoord.x + stepSize.x, vTexCoord.y + stepSize.y)) | ||
+ | * uKernel[8]; | ||
+ | |||
+ | sum.a = 1.0; | ||
+ | |||
+ | gl_FragColor = sum; | ||
+ | } | ||
+ | </source> | ||
+ | |||
+ | The Vulkan APIs Shader Compute Program: | ||
+ | <source line lang="C"> | ||
+ | VkShaderModuleCreateInfo createInfo = {}; | ||
+ | createInfo.sType = VK_STRUCTURE_TYPE_SHADER_MODULE_CREATE_INFO; | ||
+ | createInfo.codeSize = shaderCode.size(); | ||
+ | createInfo.pCode = reinterpret_cast<const uint32_t*>(shaderCode.data()); | ||
+ | |||
+ | VkShaderModule shaderModule; | ||
+ | if (vkCreateShaderModule(device, &createInfo, nullptr, &shaderModule) != VK_SUCCESS) { | ||
+ | throw std::runtime_error("Failed to create shader module!"); | ||
+ | } | ||
+ | </source> | ||
+ | |||
+ | |||
+ | ==== 3. Integrate the shader code into the Darknet CNN framework, which is used to build the YOLOv3 model.==== | ||
+ | This may involve modifying the existing Darknet code to support the OpenGLES API calls.Integrating the shader code into the Darknet CNN framework involves modifying the existing codebase to support the OpenGLES API calls. The modified code allow for the execution of the selected layers on the GPU using the optimized shader code.The goal of integrating the shader code into the Darknet CNN framework is to allow for the efficient execution of the selected layers on the GPU, resulting in faster and more accurate object detection using the YOLOv3 model. | ||
+ | |||
+ | |||
+ | ==== 4. Compile and build the modified Darknet code with the integrated OpenGLES shaders ==== | ||
+ | |||
+ | 1. Installing [https://pjreddie.com/darknet/install/ Dependencies] such as CUDA, OpenCV, etc. </br> | ||
+ | 2. Modifying and building the darknet code which involve adding code to the existing darknet file or will be creating new file. </br> | ||
+ | 3. Test and deploy the modified code. </br> | ||
+ | |||
+ | ==== 5. Test the performance of the modified YOLOv3 model with and without GPU acceleration to measure the speed-up achieved by the GPU acceleration.==== | ||
+ | |||
+ | |||
+ | ==Timeline== | ||
{| class="wikitable" | {| class="wikitable" | ||
Line 54: | Line 225: | ||
| Apr 4 || Application Deadline|| | | Apr 4 || Application Deadline|| | ||
* Submitting Proposal to the mentors | * Submitting Proposal to the mentors | ||
− | |||
− | |||
| | | | ||
|- | |- | ||
− | | | + | | May 4 || Selection Phase || |
− | * | + | * Proposal accepted or rejected |
− | |||
|- | |- | ||
− | | May 4 - May 10 || | + | |
− | * | + | | May 4 - May 10 || Community Bonding || |
− | * | + | * Discuss implementation idea with mentors. |
− | + | * Discuss the Scope of the project | |
|- | |- | ||
− | | May 10 - | + | |
− | * | + | | May 10 - June 5 || College Exams || |
+ | * Focus on College Exams | ||
|- | |- | ||
− | | June | + | |
+ | | June 5|| Milestone #1 || | ||
* Introductory YouTube video | * Introductory YouTube video | ||
* Develop Conceptual knowledge | * Develop Conceptual knowledge | ||
− | * | + | * Research and familiarize with Vulkan. |
+ | * Implementing and benchmarking darknet on the host. | ||
+ | * Reasearch about TIDL Implementation. | ||
+ | |||
|- | |- | ||
− | | June | + | |
− | * | + | |June 12||Milestone #2 || |
− | * | + | * Learning Vulkan APIs Implementation. |
− | * | + | * Porting Darknet to the Target Platform (BeagleBoard) |
+ | * Research TIDL APIs | ||
+ | * Edge AI | ||
+ | |- | ||
+ | |||
+ | | June 19 || Milestone #3 || | ||
+ | * Sharing the cross-compiled darknet folder to the BBAI-64. | ||
+ | * Benchmarking and verifying the Implementation. | ||
|- | |- | ||
− | | July | + | |
− | * | + | | July 10 || Milestone #4, Midterm Evaluation || |
− | + | * Provide the Mid-Term report to the Mentor. | |
− | |||
|- | |- | ||
− | | | + | |
− | * | + | | July 17th|| Milestone #5 || |
+ | * Work on the feedback received from the mentors. | ||
+ | * Improving the overall program for better output. | ||
|- | |- | ||
− | | | + | |
− | * | + | | July 24th|| Milestone #6|| |
− | * | + | * Implementing the final layer. |
+ | * Working on the darknet framework. | ||
|- | |- | ||
− | | | + | |
− | * | + | | July 31st|| Milestone #7 || |
− | * | + | * Running and Testing on the BeagleBoard X15/AI. |
+ | * Work on the feedback received from the mentor. | ||
|- | |- | ||
− | |||
+ | | August 14th || Final Submission || | ||
+ | * Completing the final documentation. | ||
+ | * Submit final work product and final mentor evaluation. | ||
+ | * Completion of GSoC | ||
+ | |- | ||
+ | |} | ||
− | + | =Experience and approach= | |
This project requires knowledge in Neural Networks, convolution, C/C++, Linux kernel and OpenGLES. | This project requires knowledge in Neural Networks, convolution, C/C++, Linux kernel and OpenGLES. | ||
* I have Previously Worked on the [https://github.com/Pratham-Bot/GPGPU-with-GLES/tree/main GPGPU-WITH-GLES] project. Hence, I have good understanding of OpenGLES APIs, Shaders and Linux Kernels. | * I have Previously Worked on the [https://github.com/Pratham-Bot/GPGPU-with-GLES/tree/main GPGPU-WITH-GLES] project. Hence, I have good understanding of OpenGLES APIs, Shaders and Linux Kernels. | ||
− | * I am well-worsed with different types of GPU-capable shaders and I am aware of which of them would be suitable for this project. | + | * Since Vulkan requires some high level coding knowledge, I am alredy familiar with OpenGLES lamguage and it will be easy to learn Vulkan. |
+ | * I am well-worsed with different types of GPU-capable shaders and I am aware of which of them would be suitable for this project. | ||
+ | * I have also performed Operations such as Matrix Mulltiplication and transpose of a Matrix. | ||
* I have been exploring Neural Networks and Convolutions and have gained sufficient knowledge to start the implementation. | * I have been exploring Neural Networks and Convolutions and have gained sufficient knowledge to start the implementation. | ||
* I also have beaglebone(pocket beagle) and have tried implementing the darknet framework on it. | * I also have beaglebone(pocket beagle) and have tried implementing the darknet framework on it. | ||
Line 109: | Line 299: | ||
* I will keep contributing to the project after GSoC and will be interacting with the community often. | * I will keep contributing to the project after GSoC and will be interacting with the community often. | ||
− | + | =Contingency= | |
If I get through any contingencies, I will refer the following resources: | If I get through any contingencies, I will refer the following resources: | ||
* I Have list of resources available online. So if I get stuck I will refer those resources. | * I Have list of resources available online. So if I get stuck I will refer those resources. | ||
* I will use Beagle Slack to communicate with other mentors. | * I will use Beagle Slack to communicate with other mentors. | ||
− | + | =Benefit= | |
* The Performance of the YOLOv3 model is improved which will lead to better object detection. | * The Performance of the YOLOv3 model is improved which will lead to better object detection. | ||
* Many layers can be accelerated at a time hence the efficiency of the model is improved. | * Many layers can be accelerated at a time hence the efficiency of the model is improved. | ||
* Memory Usage is reduced by loading the computations on GPU as discussed [https://stackoverflow.com/questions/13303219/reducing-ram-usage-with-regard-to-textures here]. | * Memory Usage is reduced by loading the computations on GPU as discussed [https://stackoverflow.com/questions/13303219/reducing-ram-usage-with-regard-to-textures here]. |
Latest revision as of 07:16, 18 June 2023
Contents
- 1 Proposal for OpenGLES acceleration for DL
- 2 Status
- 3 Proposal
- 4 About you
- 5 About your project
- 5.1 Description
- 5.2 Implementation Details
- 5.2.1 1. Identifying the layer that can benefit from GPU acceleration
- 5.2.2 2. Writing the shader code using the OpenGLES and Vulkan API to perform the computations required by the selected layers on the GPU.
- 5.2.3 3. Integrate the shader code into the Darknet CNN framework, which is used to build the YOLOv3 model.
- 5.2.4 4. Compile and build the modified Darknet code with the integrated OpenGLES shaders
- 5.2.5 5. Test the performance of the modified YOLOv3 model with and without GPU acceleration to measure the speed-up achieved by the GPU acceleration.
- 5.3 Timeline
- 6 Experience and approach
- 7 Contingency
- 8 Benefit
Proposal for OpenGLES acceleration for DL
- Student: Pratham Deshmukh
- Code : darknet
- Mentors: Shreyas Atre
- Proposal: OpenGLES acceleration for DL
- Wiki : NA
- GSoC : Proposal Request
Status
This project is currently just a proposal.
Proposal
- Completed All the requirements listed on the ideas page.
- The PR request for cross-compilation task.
About you
- IRC Nickname: Pratham
- Github: Pratham Deshmukh
- College: Veermata Jijabai Technological Institute
- Country: India
- Primary language: English, Hindi, Marathi
- Typical work hours: 9am to 5pm
- Previous GSoC participation: This is my first time participating in GSoC.
About your project
Project name: OpenGLES acceleration for DL
Description
Overview
Deep Learning is a subset of Machine Learning which involves use of Neural Network with multiple Layers. Neural networks consist of multiple layers of interconnected nodes, each building upon the previous layer to refine and optimise the prediction.
The main goal of the project is to accelerate as many layers types as possible using APIs such as OpenGLES, Vulkan and Darknet as Deep Learning framemork.
Shaders
Shaders are the user-defined program that run on the GPU of the board. The use of shaders for computation can result in significant speedup, as GPUs are designed to process large amounts of data in parallel. Out of various shaders that can be used on the GPU, the shaders I will be using are the Compute Shaders.They can be used to perform parallel computations, such as matrix multiplication and convolution, which are often used in deep learning applications. Compute shaders can be written using the GLSL programming language, and can be executed on the GPU using the glDispatchCompute function in the OpenGL API.
Darknet
Darknet is an open source neural network framework written in C and CUDA. It is fast,compatible, easy to install, and supports CPU and GPU computation. Darknet is used in the project to implement the YOLO object detection and recognition model.
You Only Look Once (YOLO) is a state-of-the-art, real-time object detection algorithm that uses a single neural network to detect objects. The YOLO model consists of multiple convolutional layers that extract features from the input image and several fully connected layers that produce the output of the model. These layers have many parameters that need to be optimised during training to achieve high accuracy in object detection and recognition.
In this project, Darknet is used as the deep learning framework to implement the YOLO model and optimise its performance.
YOLO Pipeline
Out of the various YOLO pipelines(YOLO,YOLOv2,YOLOv3,etc), I will be adapting YOLOv3 in this project. YOLOv3 is extremely fast and accurate. In mAP measured at .5 IOU YOLOv3 is on par with Focal Loss but about 4x faster. Moreover, you can easily trade off between speed and accuracy simply by changing the size of the model.
The YOLOv3 model consist of various layers such as Convolution layers, Route layer, Up-Sampling layers, Region layer, Maxpool layer etc. Thus we will be performing computations on these layers to accelerate the performance of the YOLOv3 model. To accelerate the performance of the YOLOv3 model, we will utilize the OpenGLES-enabled GPU on the target hardware platform. The GPU can be used to perform the computations required by certain layers in the neural network using parallel processing, which can greatly speed up the processing time.
Vulkan
Vulkan can be used for general-purpose computing, and provides features such as compute shaders, which can be used to perform complex computations in parallel on GPUs. The project will involve implementing and optimizing compute kernels for various layers of neural networks using Vulkan compute shaders. These kernels will include operations such as convolution, pooling and more. The Vulkan API will be used to manage resources such as buffers and images, as well as to schedule compute shader execution on the GPU.
Just like OpenGLES, this new interface describes what the application intends to do, which can lead to better performance and less surprising driver behaviour compared to existing APIs like OpenGLES. Vulkan is a newer API that provides more control and flexibility. It is designed to take advantage of modern GPU hardware and can provide better performance compared to OpenGLES in some cases.
One of the key feature of Vulkan is that the compute shader is completely separated from the graphics part of the pipeline.
With the compute shader stage being detached from the graphics pipeline we'll be able to use it anywhere.
- Data type Extension
- By default, the 32 bit floating precision is used for both training and inferencing, which are basically just running a computational graph:
- Training runs a forward pass, and often times a backward pass to propagate back the gradient
- Inferencing is just about doing the forward pass
- But both can be done in lower precision types for faster compute time and reduced data storage.
- The following extensions are available in vulkan:
- VK_KHR_shader_float16_int8
- VK_KHR_8bit_storage / VK_KHR_16bit_storage
- 8 bit integers data types are used for quantized Neural Nets
- FP16 data types can be used for faster math with gradient rescaling in training
- The following extensions are available in vulkan:
- By default, the 32 bit floating precision is used for both training and inferencing, which are basically just running a computational graph:
- Improved Compute Shader
- New extensions are devised to improve efficiency
- VK_KHR_workgroup_memory_explicit_layout
- Allow more efficient data loading into shared memory for further use with efficient matrix multiplication operations.
- VK_EXT_ML_primitives: Exposes basic primitives used in the main stream Neural Nets as optimized building blocks
- New extensions are devised to improve efficiency
- Extension Available:
- VK_NV_cooperative_matrix
- Accelerates large, low-precision matrix multiplies
- Exposes high throughput matrix/vector multiplication units.
- Typically used be convolution / matmul layer in fp16 formats.
- Core compute function for deep learning
- Following Code snippets illustrates how you might employ the extension.
1 //This code performs a matrix multiplication operation
2 // using cooperative matrices loaded from two input matrices A and B.
3
4 for (uint chunkK = 0; chunkK < K; chunkK += TILE_K) {
5 fcoopmatNV<16, gl_ScopeSubgroup, lM, lK> matA[C_ROWS];
6 [[unroll]] for (uint i = 0; i < C_ROWS; ++i) {
7 uint gi = TILE_M * tileID.y + lM * i;
8 uint gk = chunkK;
9 coopMatLoadNV(matA[i], inputA.x, strideA * gi + gk, strideA, false);
10 }
11 fcoopmatNV<16, gl_ScopeSubgroup, lK, lN> matB;
12 [[unroll]] for (uint j = 0; j < C_COLS; ++j) {
13 uint gj = TILE_N * tileID.x + lN * j;
14 uint gk = chunkK;
15 coopMatLoadNV(matB, inputB.x, strideB * gk + gj, strideB, false);
16 [[unroll]] for (uint i = 0; i < C_ROWS; ++i) {
17 result[i][j] = coopMatMulAddNV(matA[i], matB, result[i][j]);
18 }
19 }
20 }
Benefits of Vulkan Compute Shaders
- Highly parallelized computation: Compute shaders in Vulkan are designed to execute a large number of parallel computations on GPU hardware, which can provide significant performance benefits over CPU-based computations.
- Flexibility: We can write custom shaders to perform a wide range of compute tasks, including machine learning inference. This allows for greater flexibility in optimising performance and achieving better accuracy.
- Memory access: Compute shaders can access memory resources that are shared with graphics shaders, providing more efficient memory utilization and reducing the need for data transfers between the CPU and GPU.
- Synchronization: Vulkan provides synchronization mechanisms for coordinating access to shared memory resources and ensuring that compute shaders execute in the correct order. This will allow us to take advantage of the parallelism of compute shaders while avoiding race conditions and other synchronization issues.
Implementation Details
1. Identifying the layer that can benefit from GPU acceleration
Convolution layer:
It is a fundamental building block in deep neural networks. The convolution operation involves sliding a filter or kernel over an input image, computing dot products between the filter and local patches of the image to produce a feature map.The convolution layer is used extensively in the backbone network to extract high-level features from the input image. By adapting the convolution layer for acceleration using OpenGLES shaders, we can significantly speed up the computation time and improve the overall performance of the YOLOv3 model on resource-constrained devices.
Route layer:
The route layer can also be used in the implementation to accelerate the YOLOv3 pipeline using OpenGLES. The route layer is used to concatenate feature maps from different layers. It can concatenate two or more feature maps along the channel dimension. By doing so, it enables the network to combine features learned from different layers and extract more complex features.
Up-Sampling layer:
Upsampling layers can be used in the YOLO pipeline to increase the resolution of the feature maps before passing them to subsequent layers. Upsampling can be implemented using various techniques such as bilinear or nearest-neighbor interpolation, or transposed convolution.
Region layer:
The region layer is an important layer in the YOLOv3 model that is responsible for predicting the object bounding boxes and associated class probabilities.
Maxpool layer:
The maxpool layer can be used in the YOLO pipeline to downsample the feature maps and reduce their spatial resolution. The maxpool layer can be used to extract the most important features from each local region of the input feature map and reduce its size, thus reducing the computational cost of subsequent layers.
2. Writing the shader code using the OpenGLES and Vulkan API to perform the computations required by the selected layers on the GPU.
The shader code will need to be optimized for parallel processing. Here is an example of shader code for a convolution operation using the OpenGLES API:
1 uniform float uKernel[9];
2 uniform sampler2D uSampler;
3 uniform vec2 uTextureSize;
4
5 varying vec2 vTexCoord;
6
7 void main(void)
8 {
9 vec4 sum = vec4(0.0);
10 vec2 stepSize = 1.0/(uTextureSize);
11
12 sum += texture2D(uSampler, vec2(vTexCoord.x - stepSize.x, vTexCoord.y - stepSize.y))
13 * uKernel[0];
14 sum += texture2D(uSampler, vec2(vTexCoord.x, vTexCoord.y - stepSize.y))
15 * uKernel[1];
16 sum += texture2D(uSampler, vec2(vTexCoord.x + stepSize.x, vTexCoord.y - stepSize.y))
17 * uKernel[2];
18
19 sum += texture2D(uSampler, vec2(vTexCoord.x - stepSize.x, vTexCoord.y))
20 * uKernel[3];
21 sum += texture2D(uSampler, vec2(vTexCoord.x, vTexCoord.y))
22 * uKernel[4];
23 sum += texture2D(uSampler, vec2(vTexCoord.x + stepSize.x, vTexCoord.y))
24 * uKernel[5];
25
26 sum += texture2D(uSampler, vec2(vTexCoord.x - stepSize.x, vTexCoord.y + stepSize.y))
27 * uKernel[6];
28 sum += texture2D(uSampler, vec2(vTexCoord.x, vTexCoord.y + stepSize.y))
29 * uKernel[7];
30 sum += texture2D(uSampler, vec2(vTexCoord.x + stepSize.x, vTexCoord.y + stepSize.y))
31 * uKernel[8];
32
33 sum.a = 1.0;
34
35 gl_FragColor = sum;
36 }
The Vulkan APIs Shader Compute Program:
1 VkShaderModuleCreateInfo createInfo = {};
2 createInfo.sType = VK_STRUCTURE_TYPE_SHADER_MODULE_CREATE_INFO;
3 createInfo.codeSize = shaderCode.size();
4 createInfo.pCode = reinterpret_cast<const uint32_t*>(shaderCode.data());
5
6 VkShaderModule shaderModule;
7 if (vkCreateShaderModule(device, &createInfo, nullptr, &shaderModule) != VK_SUCCESS) {
8 throw std::runtime_error("Failed to create shader module!");
9 }
3. Integrate the shader code into the Darknet CNN framework, which is used to build the YOLOv3 model.
This may involve modifying the existing Darknet code to support the OpenGLES API calls.Integrating the shader code into the Darknet CNN framework involves modifying the existing codebase to support the OpenGLES API calls. The modified code allow for the execution of the selected layers on the GPU using the optimized shader code.The goal of integrating the shader code into the Darknet CNN framework is to allow for the efficient execution of the selected layers on the GPU, resulting in faster and more accurate object detection using the YOLOv3 model.
4. Compile and build the modified Darknet code with the integrated OpenGLES shaders
1. Installing Dependencies such as CUDA, OpenCV, etc.
2. Modifying and building the darknet code which involve adding code to the existing darknet file or will be creating new file.
3. Test and deploy the modified code.
5. Test the performance of the modified YOLOv3 model with and without GPU acceleration to measure the speed-up achieved by the GPU acceleration.
Timeline
Date | Status | Details | |
---|---|---|---|
Apr 4 | Application Deadline |
|
|
May 4 | Selection Phase |
| |
May 4 - May 10 | Community Bonding |
| |
May 10 - June 5 | College Exams |
| |
June 5 | Milestone #1 |
| |
June 12 | Milestone #2 |
| |
June 19 | Milestone #3 |
| |
July 10 | Milestone #4, Midterm Evaluation |
| |
July 17th | Milestone #5 |
| |
July 24th | Milestone #6 |
| |
July 31st | Milestone #7 |
| |
August 14th | Final Submission |
|
Experience and approach
This project requires knowledge in Neural Networks, convolution, C/C++, Linux kernel and OpenGLES.
- I have Previously Worked on the GPGPU-WITH-GLES project. Hence, I have good understanding of OpenGLES APIs, Shaders and Linux Kernels.
- Since Vulkan requires some high level coding knowledge, I am alredy familiar with OpenGLES lamguage and it will be easy to learn Vulkan.
- I am well-worsed with different types of GPU-capable shaders and I am aware of which of them would be suitable for this project.
- I have also performed Operations such as Matrix Mulltiplication and transpose of a Matrix.
- I have been exploring Neural Networks and Convolutions and have gained sufficient knowledge to start the implementation.
- I also have beaglebone(pocket beagle) and have tried implementing the darknet framework on it.
- I am passionate Open Source enthusiast and I will do the work wholeheartedly. I have my commitment to GSoC and I would do everything in my power to finish the project idea within the allotted time.
- I will keep contributing to the project after GSoC and will be interacting with the community often.
Contingency
If I get through any contingencies, I will refer the following resources:
- I Have list of resources available online. So if I get stuck I will refer those resources.
- I will use Beagle Slack to communicate with other mentors.
Benefit
- The Performance of the YOLOv3 model is improved which will lead to better object detection.
- Many layers can be accelerated at a time hence the efficiency of the model is improved.
- Memory Usage is reduced by loading the computations on GPU as discussed here.