Difference between revisions of "TensorRT/How2Debug"
(Created page with "---- ===== <big> How to dump the output of certain layer?</big> ===== TensorRT doesn’t store the intermediate result of your network, so you have to use the following API to...") |
|||
Line 1: | Line 1: | ||
+ | ---- | ||
+ | ==== <big> Layer Dump and Analyze</big> ==== | ||
+ | Refer to the [https://elinux.org/TensorRT/LayerDumpAndAnalyze page]<br> | ||
---- | ---- | ||
===== <big> How to dump the output of certain layer?</big> ===== | ===== <big> How to dump the output of certain layer?</big> ===== |
Revision as of 20:03, 31 July 2019
Contents
Layer Dump and Analyze
Refer to the page
How to dump the output of certain layer?
TensorRT doesn’t store the intermediate result of your network, so you have to use the following API to mark the intended layer as output layer, and then interference again and save its result for further analysis,
network->markOutput(“layer_name”)
NOTE:
- You can set multiplier layers as the output at the same time, but setting the layer as output may break the network optimization and impact the inference performance, as TensorRT always runs output layer in FP32 mode, no matter which mode you have configured.
- Don’t forget to adjust the dimension or output buffer size after you change the output layer.
How to debug ONNX model with setting extra output layer?
Sometimes we need to debug our model with dumping output of middle layer, this FAQ will show you a way to set middle layer as output for debugging ONNX model.
The below steps are setting one middle layer of mnist.onnx model as output using the patch shown at the bottom.
- Download onnx-tensorrt and mnist.onnx
- Get all nodes info: Apply the first section "dump all nodes' output" change and build onx2trt.then run the command to get all nodes:
$ ./onnx2trt mnist.onnx -o mnist.engine
- Set one layer as output: Pick up the node name from the output of step2, and set it as output with the 2nd section "set one layer as output" change, rebuild onx2trt, and run below command to regenerate the engine.
$ ./onnx2trt mnist.onnx -o mnist.engine
- Dump output with the engine file:
$ ./trtexec --engine=mnist.engine --input=Input3 --output=Plus214_Output_0 --output=Convolution110_Output_0 --dumpOutput
Here is the patch based on onnx-tensorrt
diff --git a/ModelImporter.cpp b/ModelImporter.cpp index ac4749c..8638add 100644 --- a/ModelImporter.cpp +++ b/ModelImporter.cpp @@ -524,6 +524,19 @@ ModelImporter::importModel(::ONNX_NAMESPACE::ModelProto const &model, output_names.push_back(model.graph().output(i).name()); } + // ======= dump all nodes' output ============ + int node_size = graph.node_size(); + cout << "ModelImporter::importModel : graph.node_size() = " << node_size << " *******" << endl; + for (int i = 0; i < graph.node_size(); i++) { + ::ONNX_NAMESPACE::NodeProto const& node = graph.node(i); + if( node.output().size() > 0 ) { + cout << "node[" << i << "] = " + << node.output(0) << ":" + << node.op_type() << endl; + } + } + // ========================================= + string_map<TensorOrWeights> tensors; TRT_CHECK(importInputs(&_importer_ctx, graph, &tensors, weight_count, weight_descriptors)); @@ -559,10 +572,17 @@ ModelImporter::importModel(::ONNX_NAMESPACE::ModelProto const &model, } } _current_node = -1; + + // =========== set one layer as output, below "Convolution110_Output_0" is from abobe dump == + nvinfer1::ITensor* new_output_tensor_ptr = &tensors.at("Convolution110_Output_0").tensor(); + _importer_ctx.network()->markOutput(*new_output_tensor_ptr); + // ========================================================================================== +
How to analyze network performance?
First of all, we should be aware of the profiling command tool that TensorRT provides - trtexec.
If all your network layer has been supported by TensorRT through either native way or plugin way, you can always utilize this tool to profile your network very quickly.
Second, you can add profiling metrics for your application manually from CPU side (link) or GPU side (link).
NOTE:
- Time collection should only contain the network enqueue() or execute() and any context set-up or memory initialization or refill operation should be excluded.
- Add more iterations for the time collection, in order to avagage the GPU warm-up effect.
Third, if you would like to scope the time consumption of each layer, you can implement IProfiler to achieve that, or utilize SimpleProfiler TensorRT already provides (refer to below patch for sampleSSD),
--- sampleSSD.cpp.orig 2019-05-27 12:39:14.193521455 +0800 +++ sampleSSD.cpp 2019-05-27 12:38:59.393358775 +0800 @@ -428,8 +428,11 @@ float* detectionOut = new float[N * kKEEP_TOPK * 7]; int* keepCount = new int[N]; + SimpleProfiler profiler (" layer time"); + context->setProfiler(&profiler); // Run inference doInference(*context, data, detectionOut, keepCount, N); + std::cout << profiler; bool pass = true;
Print the fused layer time in the order
1. Apply below change to make sure the layer time to be printed in the order
File:Change to print the layer time in sequence.patch
2. Apply below change to profile and pritn the layer time
File:Change to profile the layer time.patch
3. Profile
TensorRT-5.1.6.0/bin$ ./trtexec --deploy=ResNet50_N2.prototxt --output=prob --int8 --batch=128 &&&& RUNNING TensorRT.trtexec # ./trtexec --deploy=ResNet50_N2.prototxt --output=prob --int8 --batch=128 .. [I] Average over 10 runs is 25.8291 ms (host walltime is 25.8599 ms, 99% percentile time is 26.0829). ========== layertime profile ========== TensorRT layer name Runtime, % Invocations Runtime, ms conv1 + conv1_relu input reformatter 0 1.6% 100 40.94 conv1 + conv1_relu 9.9% 100 256.82 pool1 2.8% 100 72.19 res2a_branch2a + res2a_branch2a_relu 1.2% 100 31.29 res2a_branch2b + res2a_branch2b_relu 2.1% 100 53.49 res2a_branch2c 3.2% 100 81.77 res2a_branch1 + res2a + res2a_relu 3.7% 100 96.50 res2b_branch2a + res2b_branch2a_relu 2.0% 100 51.35 res2b_branch2b + res2b_branch2b_relu 2.1% 100 53.62 res2b_branch2c + res2b + res2b_relu 3.7% 100 96.38 res2c_branch2a + res2c_branch2a_relu 2.0% 100 51.28 res2c_branch2b + res2c_branch2b_relu 2.1% 100 53.81 ...... <Omit some layers> ......... res5c_branch2a + res5c_branch2a_relu 0.8% 100 21.43 res5c_branch2b + res5c_branch2b_relu 1.7% 100 43.53 res5c_branch2c + res5c + res5c_relu 1.0% 100 26.37 pool5 0.4% 100 10.40 fc1000 input reformatter 0 0.1% 100 1.62 fc1000 0.6% 100 16.06 prob 0.1% 100 1.31 ========== layertime total runtime = 2585.28 ms ==========
In above log, you can also find out which layers are fused, for exmaple, "res5c_branch2c + res5c + res5c_relu" indicates layer res5c_branch2c, res5c and res5c_relu are fused as one layer.