Difference between revisions of "TensorRT/PerfIssues"
(→CUDA Perspective) |
|||
(2 intermediate revisions by the same user not shown) | |||
Line 8: | Line 8: | ||
# [https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#implicit-synchronization Any CUDA command to the NULL stream will cause an implicit synchronization] | # [https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#implicit-synchronization Any CUDA command to the NULL stream will cause an implicit synchronization] | ||
* Run parallel CUDA tasks, e.g. different TensorRT inference instances, on different CUDA streams | * Run parallel CUDA tasks, e.g. different TensorRT inference instances, on different CUDA streams | ||
− | * No cudaMalloc() called in main loop of the application | + | * No cudaMalloc() called in the main loop of the application |
+ | |||
=== TensorRT Perspective === | === TensorRT Perspective === | ||
* [https://docs.nvidia.com/deeplearning/sdk/tensorrt-best-practices/index.html#batching Batching] | * [https://docs.nvidia.com/deeplearning/sdk/tensorrt-best-practices/index.html#batching Batching] | ||
Line 29: | Line 30: | ||
* Example of "./trtexec --deploy=ResNet-50-deploy.prototxt --output=prob --int8 --batch=8 --dumpProfile" <br> | * Example of "./trtexec --deploy=ResNet-50-deploy.prototxt --output=prob --int8 --batch=8 --dumpProfile" <br> | ||
:Log - ''[[File:Trtexec log.txt|thumb]]'' | :Log - ''[[File:Trtexec log.txt|thumb]]'' | ||
− | ==== | + | ==== Measure the Inference Time ==== |
+ | User can use CPU Timing or CUDA Event to measure the inference time. Since GPU needs some milliseconds to warmup, to eliminate the impact of the warmup, it's better to measure the time with many loops, e.g. 500 inference loops or add 100 loops before profiling. | ||
* [https://docs.nvidia.com/deeplearning/sdk/tensorrt-best-practices/index.html#cpu-timing CPU Timing] | * [https://docs.nvidia.com/deeplearning/sdk/tensorrt-best-practices/index.html#cpu-timing CPU Timing] | ||
* [https://docs.nvidia.com/deeplearning/sdk/tensorrt-best-practices/index.html#cuda-events CUDA Events] | * [https://docs.nvidia.com/deeplearning/sdk/tensorrt-best-practices/index.html#cuda-events CUDA Events] | ||
Line 35: | Line 37: | ||
:TensorRT also includes an optional CUDA event in the method IExecutionContext::enqueue that will be signaled once the input buffers are free to be reused. This allows the application to immediately start refilling the input buffer region for the next inference in parallel with finishing the current inference. For example:<br> | :TensorRT also includes an optional CUDA event in the method IExecutionContext::enqueue that will be signaled once the input buffers are free to be reused. This allows the application to immediately start refilling the input buffer region for the next inference in parallel with finishing the current inference. For example:<br> | ||
<pre style="margin-left:30px; color:#B0B0B0; background-color:#111111; white-space:pre-wrap;"> | <pre style="margin-left:30px; color:#B0B0B0; background-color:#111111; white-space:pre-wrap;"> | ||
− | + | cudaEvent_t inputReady; | |
cudaEventCreate(&inputReady); | cudaEventCreate(&inputReady); | ||
Line 43: | Line 45: | ||
// At this point we can refill the input buffers, but output buffers may not be done</code> | // At this point we can refill the input buffers, but output buffers may not be done</code> | ||
</pre> | </pre> | ||
+ | ==== IProfiler ==== | ||
+ | IProfiler is a TensorRT Built-In TensorRT Profiling tool | ||
+ | IProfiler interface is provided in the common sample code (common.h), below is sample change to apply IProfiler | ||
+ | <pre style="margin-left:30px; color:#B0B0B0; background-color:#111111; white-space:pre-wrap;"> | ||
+ | --- sampleSSD.cpp.orig 2019-05-27 12:39:14.193521455 +0800 | ||
+ | +++ sampleSSD.cpp 2019-05-27 12:38:59.393358775 +0800 | ||
+ | @@ -428,8 +428,11 @@ | ||
+ | float* detectionOut = new float[N * kKEEP_TOPK * 7]; | ||
+ | int* keepCount = new int[N]; | ||
+ | |||
+ | + SimpleProfiler profiler (" layer time"); | ||
+ | + context->setProfiler(&profiler); | ||
+ | // Run inference | ||
+ | doInference(*context, data, detectionOut, keepCount, N); | ||
+ | + std::cout << profiler; | ||
+ | |||
+ | bool pass = true; | ||
+ | </pre> | ||
+ | With IProfiler, after inference finish, profiler reports the timing for each layer in the network like log below.<br> | ||
+ | <gallery> | ||
+ | IProfiler.png | ||
+ | </gallery> | ||
+ | <br> | ||
+ | ==== VERBOSE Log ==== | ||
+ | With "gLogger.setReportableSeverity(nvinfer1::ILogger::Severity::kVERBOSE);" to enable VERBOSE log, after TensorrRT build, it will report TensorRT tatic selection informantion like<br> | ||
+ | log - [[File:Verbose log.txt|thumb]] | ||
+ | In such log, there are logs like: | ||
+ | ::''conv2/3x3 + conv2/relu_3x3 (i8816cudnn) '''Set Tactic Name:''' volta_int8_i8816cudnn_int8_256x64_ldg16_relu_singleBuffer_small_nt_v1'' | ||
+ | in which, "conv2/3x3 + conv2/relu_3x3 (i8816cudnn)" is layer name, "volta_int8_i8816cudnn_int8_256x64_ldg16_relu_singleBuffer_small_nt_v1" is the CUDA kernel selected for this layer, if its name includes i8816, it indicates Tensor Core is selected. | ||
+ | ==== [https://docs.nvidia.com/cuda/profiler-users-guide/index.html#nvprof-overview nvprof] ==== | ||
+ | * User can call cudaProfilerStart() / cudaProfilerStop() like below to limit the profiling region | ||
+ | <pre style="margin-left:30px; color:#B0B0B0; background-color:#111111; white-space:pre-wrap;"> | ||
+ | diff --git a/trtexec.cpp b/trtexec.cpp | ||
+ | index 95d01fb..d0747b6 100644 | ||
+ | --- a/trtexec.cpp | ||
+ | +++ b/trtexec.cpp | ||
+ | @@ -63,6 +63,7 @@ | ||
+ | #include <time.h> | ||
+ | #include <vector> | ||
− | + | +#include <cuda_profiler_api.h> | |
− | + | @@ -427,9 +428,11 @@ void doInference(ICudaEngine& engine) | |
− | :: | + | { |
− | + | cudaEventRecord(start, stream); | |
− | :: | + | + cudaProfilerStart(); |
− | :: <code> | + | context->enqueue(gParams.batchSize, &buffers[0], stream, nullptr); |
− | : | + | cudaEventRecord(end, stream); |
− | + | cudaEventSynchronize(end); | |
− | : | + | + cudaProfilerStop(); |
− | : | + | |
− | :: | + | auto tEnd = std::chrono::high_resolution_clock::now(); |
− | :: | + | </pre> |
− | : | + | * sample use cases |
− | :: | + | # Profile Tensor Core FP16/INT8 Utilization<br> |
− | + | ::tensor_precision_fu_utilization and tensor_int_fu_utilization two metrics can be used to profile FP16 and INT8 Tensor Core utilization respectively (more info can refer to nvprof guidance webpage) | |
− | : | + | :: <code>$ nvprof --profile-from-start off -m tensor_precision_fu_utilization,tensor_int_fu_utilization ./trtexec --deploy=ResNet-50-deploy.prototxt --output=prob --int8 --batch=2 --avgRuns=1 --iterations=1</code><br> |
− | * | + | ::log - [[File:Tensor core int8 fp16 utilization.txt|thumb]] |
+ | # Profile Tensor Core FP16/INT8 Utilization and GPU trace | ||
+ | :: $ # /usr/local/cuda/bin/nvprof --profile-from-start off --trace api -m tensor_precision_fu_utilization,tensor_int_fu_utilization ./trtexec --deploy=ResNet-50-deploy.prototxt --output=prob --int8 --batch=2 --avgRuns=1 --iterations=1 | ||
+ | :: log - [[File:Tensor core int8 fp16 utilization AND gpu trace.txt|thumb]] | ||
+ | # Profile GPU call trace in sequence | ||
+ | :: $ /usr/local/cuda/bin/nvprof --profile-from-start off --print-gpu-trace ./trtexec --deploy=ResNet-50-deploy.prototxt --output=prob --int8 --batch=2 --avgRuns=1 --iterations=1 | ||
+ | :: log - [[File:Gpu trace.txt|thumb]] | ||
+ | ==== [https://docs.nvidia.com/deeplearning/sdk/tensorrt-best-practices/index.html#nvprof Nsight Systems] ==== | ||
+ | The recommended CUDA profilers are NVIDIA Nsight Compute and NVIDIA Nsight Systems. Please get more info from [https://docs.nvidia.com/deeplearning/sdk/tensorrt-best-practices/index.html#nvprof 1.5. CUDA Profiling] in [https://docs.nvidia.com/deeplearning/sdk/tensorrt-best-practices/index.html Best Practices For TensorRT Performance] | ||
+ | == Further Measures about Perf Improvement == | ||
+ | * Upgrade to the latest TensorRT version | ||
+ | * Enable DLA to offload GPU (Xavier) | ||
+ | * [https://docs.nvidia.com/deeplearning/sdk/tensorrt-best-practices/index.html#enable-fusion Layer Fusion] | ||
+ | * Design network more friendly to Tensor core (channels % 32 == 0) | ||
+ | * Network Pruning | ||
+ | * Choose faster network under specific accuracy requirement |
Revision as of 19:29, 15 January 2020
Contents
Quick Check
Basically, should follow below rules to get good perf.
CUDA Perspective
- Only one CUDA context for one GPU
- Multiple CUDA context consumes extra memory
- The different CUDA contexts sharing the same GPU are time-sliced
- Not use default CUDA stream
- Run parallel CUDA tasks, e.g. different TensorRT inference instances, on different CUDA streams
- No cudaMalloc() called in the main loop of the application
TensorRT Perspective
- Maximize the parallelism capability on GPU
- Save runtime memory consumption
- Lower Precision Mode
- Lower precision has higher compute capability
- Lower precision consume less memory
- Ensure there is enough workspace size for TensorRT inference
builder->setMaxWorkspaceSize(1_GB);
// TensorRT 5.1config->setMaxWorkspaceSize(1_GiB);
// TensorRT 6.0
Profiler
There are many useful profiler tools that can help TensorRT user to find out the performance status.
trtexec
- It's in TensorRT package (bin: TensorRT/bin/trtexec, code: TensorRT/samples/trtexec/)
- lots of handy and useful options to support
- build model using different build options with or without weight/input/calib data, save the build TensorRT engine
- inference using different inference options with or without input, or simply inference with TensorRT engine
- inference profiling
- Example of "./trtexec --deploy=ResNet-50-deploy.prototxt --output=prob --int8 --batch=8 --dumpProfile"
- Log - File:Trtexec log.txt
Measure the Inference Time
User can use CPU Timing or CUDA Event to measure the inference time. Since GPU needs some milliseconds to warmup, to eliminate the impact of the warmup, it's better to measure the time with many loops, e.g. 500 inference loops or add 100 loops before profiling.
- Note: below event can be applied in application to improve the parallal capability of inference data preparation and inference
- TensorRT also includes an optional CUDA event in the method IExecutionContext::enqueue that will be signaled once the input buffers are free to be reused. This allows the application to immediately start refilling the input buffer region for the next inference in parallel with finishing the current inference. For example:
cudaEvent_t inputReady; cudaEventCreate(&inputReady); context->enqueue(batchSize, &buffers[0], stream, &inputReady); cudaEventSynchronize(inputReady); // At this point we can refill the input buffers, but output buffers may not be done</code>
IProfiler
IProfiler is a TensorRT Built-In TensorRT Profiling tool IProfiler interface is provided in the common sample code (common.h), below is sample change to apply IProfiler
--- sampleSSD.cpp.orig 2019-05-27 12:39:14.193521455 +0800 +++ sampleSSD.cpp 2019-05-27 12:38:59.393358775 +0800 @@ -428,8 +428,11 @@ float* detectionOut = new float[N * kKEEP_TOPK * 7]; int* keepCount = new int[N]; + SimpleProfiler profiler (" layer time"); + context->setProfiler(&profiler); // Run inference doInference(*context, data, detectionOut, keepCount, N); + std::cout << profiler; bool pass = true;
With IProfiler, after inference finish, profiler reports the timing for each layer in the network like log below.
VERBOSE Log
With "gLogger.setReportableSeverity(nvinfer1::ILogger::Severity::kVERBOSE);" to enable VERBOSE log, after TensorrRT build, it will report TensorRT tatic selection informantion like
log - File:Verbose log.txt
In such log, there are logs like:
- conv2/3x3 + conv2/relu_3x3 (i8816cudnn) Set Tactic Name: volta_int8_i8816cudnn_int8_256x64_ldg16_relu_singleBuffer_small_nt_v1
in which, "conv2/3x3 + conv2/relu_3x3 (i8816cudnn)" is layer name, "volta_int8_i8816cudnn_int8_256x64_ldg16_relu_singleBuffer_small_nt_v1" is the CUDA kernel selected for this layer, if its name includes i8816, it indicates Tensor Core is selected.
nvprof
- User can call cudaProfilerStart() / cudaProfilerStop() like below to limit the profiling region
diff --git a/trtexec.cpp b/trtexec.cpp index 95d01fb..d0747b6 100644 --- a/trtexec.cpp +++ b/trtexec.cpp @@ -63,6 +63,7 @@ #include <time.h> #include <vector> +#include <cuda_profiler_api.h> @@ -427,9 +428,11 @@ void doInference(ICudaEngine& engine) { cudaEventRecord(start, stream); + cudaProfilerStart(); context->enqueue(gParams.batchSize, &buffers[0], stream, nullptr); cudaEventRecord(end, stream); cudaEventSynchronize(end); + cudaProfilerStop(); auto tEnd = std::chrono::high_resolution_clock::now();
- sample use cases
- Profile Tensor Core FP16/INT8 Utilization
- tensor_precision_fu_utilization and tensor_int_fu_utilization two metrics can be used to profile FP16 and INT8 Tensor Core utilization respectively (more info can refer to nvprof guidance webpage)
$ nvprof --profile-from-start off -m tensor_precision_fu_utilization,tensor_int_fu_utilization ./trtexec --deploy=ResNet-50-deploy.prototxt --output=prob --int8 --batch=2 --avgRuns=1 --iterations=1
- log - File:Tensor core int8 fp16 utilization.txt
- Profile Tensor Core FP16/INT8 Utilization and GPU trace
- $ # /usr/local/cuda/bin/nvprof --profile-from-start off --trace api -m tensor_precision_fu_utilization,tensor_int_fu_utilization ./trtexec --deploy=ResNet-50-deploy.prototxt --output=prob --int8 --batch=2 --avgRuns=1 --iterations=1
- log - File:Tensor core int8 fp16 utilization AND gpu trace.txt
- Profile GPU call trace in sequence
- $ /usr/local/cuda/bin/nvprof --profile-from-start off --print-gpu-trace ./trtexec --deploy=ResNet-50-deploy.prototxt --output=prob --int8 --batch=2 --avgRuns=1 --iterations=1
- log - File:Gpu trace.txt
Nsight Systems
The recommended CUDA profilers are NVIDIA Nsight Compute and NVIDIA Nsight Systems. Please get more info from 1.5. CUDA Profiling in Best Practices For TensorRT Performance
Further Measures about Perf Improvement
- Upgrade to the latest TensorRT version
- Enable DLA to offload GPU (Xavier)
- Layer Fusion
- Design network more friendly to Tensor core (channels % 32 == 0)
- Network Pruning
- Choose faster network under specific accuracy requirement