TensorRT-LLM Engine Builder

This talk covers how to optimize large language model inference using TensorRT-LLM, demonstrating automatic engine building for faster token generation and reduced latency.

Overview

TensorRT-LLM is a high-performance model inference framework from NVIDIA, but it’s hard to get started with. After building optimized model serving engines by hand for months, we created an internal tool for automatically compiling TRT-LLM engines.

In this demo, you’ll learn the basics of TensorRT-LLM optimization for higher tokens per second and lower time to first token when serving large language models. I’ll also show how you can build your own TRT-LLM engine in minutes with the engine builder we’ve built.

Links

https://github.com/basetenlabs/Workshop-TRT-LLM
This workshop demonstrates building, deploying, and benchmarking optimized LLM inference with TensorRT-LLM.
https://www.ai.engineer/worldsfair/2024/schedule/from-model-weights...
AI Engineer Worlds Fair demonstrates high-performance TensorRT-LLM GPU inference.

Tech stack