TensorRT-LLM

Model InferenceOpen SourceVerifiedOpen Source

Open-source library for optimizing large language model inference on NVIDIA GPUs. Leverages deep hardware-software co-design with custom kernels, FP8/FP4 quantization, in-flight batching, paged KV caching, and speculative decoding for peak throughput. Best suited for high-performance LLM serving on NVIDIA hardware in production environments.