Members-Only
Recent Talks & Demos are for members only
You must be an AI Tinkerers active member to view these talks and demos.
TensorRT-LLM Engine Builder
This talk covers how to optimize large language model inference using TensorRT-LLM, demonstrating automatic engine building for faster token generation and reduced latency.
TensorRT-LLM is a high-performance model inference framework from NVIDIA, but it’s hard to get started with. After building optimized model serving engines by hand for months, we created an internal tool for automatically compiling TRT-LLM engines.
In this demo, you’ll learn the basics of TensorRT-LLM optimization for higher tokens per second and lower time to first token when serving large language models. I’ll also show how you can build your own TRT-LLM engine in minutes with the engine builder we’ve built.
This workshop demonstrates building, deploying, and benchmarking optimized LLM inference with TensorRT-LLM.
AI Engineer Worlds Fair demonstrates high-performance TensorRT-LLM GPU inference.