LLMInferenceService Tutorials
These are a set of tutorial guides for deploying the KServe LLM Inference Service in a variety of configurations with a variety of models.
All of the tutorials are present in the KServe repo under samples/docs.
End-to-end guide: Run GPT-OSS-20B with KServe and llm-d
This guide walks through deploying RedHatAI/gpt-oss-20b on Kubernetes using KServe. Steps are ordered from cluster setup through inference, AI gateway routing, optional prefix caching, and monitoring.
There are 3 alternate deployments detailed here:
- default - a deployment of intelligent inference scheduling with vLLM and the llm-d scheduler
- precise prefix cache aware routing - an advanced configuration that takes advantage of vLLM KV-Events
- prefill-decode disaggregation - an advanced configuration that seperate vLLM pods for the prefill and the decode stages of inference.
Single-Node GPU Deployment Examples
Contains example configurations for deploying LLM inference services on single-node GPU setups, ranging from basic load balancing to advanced prefill-decode separation with KV cache transfer.
DeepSeek-R1 Multi-Node Deployment Examples
This contains example configurations for deploying the DeepSeek-R1-0528 model using data parallelism (DP) and expert parallelism (EP) across multiple nodes with GPU acceleration.
Precise Prefix KV Cache Routing
Precise Prefix KV Cache Routing
This contains an example configuration demonstrating advanced KV cache routing with precise prefix matching to optimize inference performance by routing requests to instances with matching cached content.