Skip to main content
Version: Next

LLMInferenceService Tutorials

These are a set of tutorial guides for deploying the KServe LLM Inference Service in a variety of configurations with a variety of models.

All of the tutorials are present in the KServe repo under samples/docs.

End-to-end guide: Run GPT-OSS-20B with KServe and llm-d

E2E GPT OSS

This guide walks through deploying RedHatAI/gpt-oss-20b on Kubernetes using KServe. Steps are ordered from cluster setup through inference, AI gateway routing, optional prefix caching, and monitoring.

There are 3 alternate deployments detailed here:

  1. default - a deployment of intelligent inference scheduling with vLLM and the llm-d scheduler
  2. precise prefix cache aware routing - an advanced configuration that takes advantage of vLLM KV-Events
  3. prefill-decode disaggregation - an advanced configuration that seperate vLLM pods for the prefill and the decode stages of inference.

Single-Node GPU Deployment Examples

Single Node GPU

Contains example configurations for deploying LLM inference services on single-node GPU setups, ranging from basic load balancing to advanced prefill-decode separation with KV cache transfer.

DeepSeek-R1 Multi-Node Deployment Examples

DeepSeek R1 GPU RDMA RoCE

This contains example configurations for deploying the DeepSeek-R1-0528 model using data parallelism (DP) and expert parallelism (EP) across multiple nodes with GPU acceleration.

Precise Prefix KV Cache Routing

Precise Prefix KV Cache Routing

This contains an example configuration demonstrating advanced KV cache routing with precise prefix matching to optimize inference performance by routing requests to instances with matching cached content.