For targets with NVIDIA GPUs, Neo may use TensorRT to optimize all or part of your model. Using TensorRT enables Neo compiled models to obtain the best possible performance on NVIDIA GPUs. The first inference after loading the model may take a few minutes while TensorRT builds the inference engine(s). After the engines are built, any further inference calls will be fast.
DLR provides several runtime flags to configure the TensorRT components of your optimized model. These flags are all configured through environment variables.
The examples will use the following test script
import dlr import numpy as np import time model = dlr.DLRModel('my_compiled_model/', 'gpu', 0) x = np.random.rand(1, 3, 224, 224) # Warmup model.run(x) times =  for i in range(100): start = time.time() model.run(x) times.append(time.time() - start) print('Latency:', 1000.0 * np.mean(times), 'ms')
$ python3 run.py Building new TensorRT engine for subgraph tensorrt_0 Finished building TensorRT engine for subgraph tensorrt_0 Latency: 3.320300579071045 ms
TVM_TENSORRT_USE_FP16=1 can be set to automatically convert the TensorRT
components of your model to 16-bit floating point precision. This can greatly increase performance,
but may cause some slight loss in the model accuracy.
$ TVM_TENSORRT_USE_FP16=1 python3 run.py Building new TensorRT engine for subgraph tensorrt_0 Finished building TensorRT engine for subgraph tensorrt_0 Latency: 1.7122554779052734 ms
During the first inference, DLR will invoke the TensorRT API to build an engine. This can be time consuming, so you can set
to point to a directory to save these built engines to on the disk. The next time you load the model and give it the same directory,
DLR will load the already built engines to avoid the long warmup time. The cached engine files can only be used on the exact same hardware and software platform that
they were generated on.
$ TVM_TENSORRT_CACHE_DIR=. python3 run.py Building new TensorRT engine for subgraph tensorrt_0 Caching TensorRT engine to ./tensorrt_0.plan Finished building TensorRT engine for subgraph tensorrt_0 Latency: 4.380748271942139 ms $ TVM_TENSORRT_CACHE_DIR=. python3 run.py Loading cached TensorRT engine from ./tensorrt_0.plan Latency: 4.414560794830322 ms
Please keep in mind that each model must have its own unique cache directory. If you are using multiple models, change the directory after loading the model and performing one inference call before loading the next model.
# Load first model os.environ["TVM_TENSORRT_CACHE_DIR"] = "model1_cache/" model1 = dlr.DLRModel(...) # Run inference at least one to load cached engine model1.run(...) # Load second model os.environ["TVM_TENSORRT_CACHE_DIR"] = "model2_cache/" model2 = dlr.DLRModel(...) # Run inference at least one to load cached engine model2.run(...) # Now both models can be used at will. model1.run(...) model2.run(...)
TensorRT has a paramter to configure the maximum amount of scratch space that each layer in the model can use.
It is generally best to use the highest value which does not cause you to run out of memory.
Neo will automatically set the max workspace size to 256 megabytes for Jetson Nano and Jetson TX1 targets, and 1 gigabyte for all other NVIDIA GPU targets.
You can use
TVM_TENSORRT_MAX_WORKSPACE_SIZE to override this by specifying the workspace size in bytes you would like to use.
$ TVM_TENSORRT_MAX_WORKSPACE_SIZE=2147483647 python3 run.py