NVIDIA TensorRT™ is a platform for high-performance deep learning inference. It includes a deep learning inference optimizer and runtime that delivers low latency and high-throughput for deep learning inference applications. TensorRT-based applications perform up to 40x faster than CPU-only platforms during inference. With TensorRT, you can optimize neural network models trained in all major frameworks, calibrate for lower precision with high accuracy, and finally deploy to hyperscale data centers, embedded, or automotive product platforms.
1. How to check TensorRT version?
There are two methods to check TensorRT version,
- Symbols from library
$ nm -D /usr/lib//aarch64-linux-gnu/libnvinfer.so | grep "tensorrt" 0000000007849eb0 B tensorrt_build_svc_tensorrt_20181028_25152976 0000000007849eb4 B tensorrt_version_5_0_3_2
NOTE: 20181028 is the build date and 25152976 is the top changelist and 5_0_3_2 is the version information.
- Macros from header file
$ cat /usr/include/aarch64-linux-gnu/NvInfer.h | grep "define NV_TENSORRT" #define NV_TENSORRT_MAJOR 5 //!< TensorRT major version. #define NV_TENSORRT_MINOR 0 //!< TensorRT minor version. #define NV_TENSORRT_PATCH 3 //!< TensorRT patch version. #define NV_TENSORRT_BUILD 2 //!< TensorRT build number. #define NV_TENSORRT_SONAME_MAJOR 5 //!< Shared object library major version number. #define NV_TENSORRT_SONAME_MINOR 0 //!< Shared object library minor version number. #define NV_TENSORRT_SONAME_PATCH 3 //!< Shared object library patch version number.
2. Whether TRT support thread-safe?
TensorRT runtime is thread-safe in the sense that parallel threads using different TRT Execution Contexts can execute in parallel without interference.
3. Can INT8 calibration table be compatible among different TRT version or HW platform?
INT8 calibration table is absolutely NOT compatible between different TRT versions. This is because the optimized network graph is probably different among various TRT versions. If you enforce to use them, TRT may not find the corresponding scaling factor for given tensor.
As long as the installed TensorRT version is identical for different HW platforms, then the INT8 calibration table can be compatible. That means you can perform INT8 calibration on a faster computation platform, like V100 or P4 and then deploy the calibration table to Tegra for INT8 inferencing.
4. How to check GPU utilization?
On Tegra platform, we can use tegrastats to achieve that,
$ sudo /home/nvidia/tegrastats
On Desktop platform, like Tesla, we can use nvidia-smi to achieve that,
$ nvidia-smi --format=csv -lms 500 --query-gpu=index,timestamp,utilization.gpu,clocks.current.graphics,clocks.current.sm,clocks.current.video,clocks.current.memory,utilization.memory,memory.total,memory.free,memory.used,power.limit,power.draw,temperature.gpu,fan.speed,compute_mode,gpu_operation_mode.current,clocks_throttle_reasons.active,pstate,clocks_throttle_reasons.hw_slowdown,clocks_throttle_reasons.gpu_idle,clocks_throttle_reasons.applications_clocks_setting,clocks_throttle_reasons.sw_power_cap,clocks_throttle_reasons.sync_boost -i 0 | tee log.cs
5. What is kernel auto-tuning?
TensorRT contains various kernel implementations, including those existing in CUDNN and CUBLAS, to accommodate diverse neural network configurations (batch, input/output dims, filters, strides, pads, dilation rate and etc). During network building, TensorRT will profile all suitable kernels and find out the best one with the smallest latency, and then mark it as the final tactic to run the certain layer. We call this process as kernel auto-tuning. Additionally, it’s not always true that INT8 kernel faster than FP16’s than FP32’s, so
- if you run FP16 precision mode, it profiles all candidates in FP16 kernel pool and FP32 kernel pool.
- if you run INT8 precision mode, it profiles all candidates in INT8 kernel pool and FP32 kernel pool.
- if both FP16 and INT8 are enabled (we call it hybrid mode), it profiles all candidate in INT8 kernel pool, FP16 kernel pool and FP32 kernel pool.
If current layer chooses different mode as its bottom layer or top layer, TensorRT will insert a reformatting layer between them to do the tensor format conversion, and the time for this reformatting layer will be taken into account as the cost of current layer during auto-tuning.