llm-d v0.5.0
Released: February 3, 2026
Full Release Notes: View on GitHub
The llm-d ecosystem consists of multiple interconnected components that work together to provide distributed inference capabilities for large language models.
Components​
| Component | Description | Repository | Version |
|---|---|---|---|
| Inference Scheduler | The scheduler that makes optimized routing decisions for inference requests to the llm-d inference framework. | llm-d/llm-d-inference-scheduler | v0.5.0 |
| Model Service | modelservice is a Helm chart that simplifies LLM deployment on llm-d by declaratively managing Kubernetes resources for serving base models. It enables reproducible, scalable, and tunable model deployments through modular presets, and clean integration with llm-d ecosystem components (including vLLM, Gateway API Inference Extension, LeaderWorkerSet). | llm-d-incubation/llm-d-modelservice | llm-d-modelservice-v0.4.5 |
| Inference Simulator | A light weight vLLM simulator emulates responses to the HTTP REST endpoints of vLLM. | llm-d/llm-d-inference-sim | v0.7.1 |
| Infrastructure | A helm chart for deploying gateway and gateway related infrastructure assets for llm-d. | llm-d-incubation/llm-d-infra | v1.3.6 |
| KV Cache | The libraries for tokenization, KV-events processing, and KV-cache indexing and offloading. | llm-d/llm-d-kv-cache | v0.5.0 |
| Benchmark Tools | This repository provides an automated workflow for benchmarking LLM inference using the llm-d stack. It includes tools for deployment, experiment execution, data collection, and teardown across multiple environments and deployment styles. | llm-d/llm-d-benchmark | v0.3.0 |
| Workload Variant Autoscaler | Graduated from experimental to core component. Provides saturation-based autoscaling for llm-d deployments. | llm-d-incubation/workload-variant-autoscaler | v0.5.0 |
| Gateway API Inference Extension | A Helm chart to deploy an InferencePool, a corresponding EndpointPicker (epp) deployment, and any other related assets. | kubernetes-sigs/gateway-api-inference-extension | v1.3.0 |
Container Images​
Container images are published to the GitHub Container Registry.
ghcr.io/llm-d/<image-name>:<version>
| Image | Description | Version | Pull Command |
|---|---|---|---|
| llm-d-cuda | CUDA-based inference image for NVIDIA GPUs | v0.5.0 | ghcr.io/llm-d/llm-d-cuda:v0.5.0 |
| llm-d-xpu | Intel XPU inference image | v0.5.0 | ghcr.io/llm-d/llm-d-xpu:v0.5.0 |
| llm-d-cpu | CPU-only inference image (New in v0.5.0) | v0.5.0 | ghcr.io/llm-d/llm-d-cpu:v0.5.0 |
| llm-d-inference-scheduler | Inference scheduler for optimized routing | v0.5.0 | ghcr.io/llm-d/llm-d-inference-scheduler:v0.5.0 |
| llm-d-routing-sidecar | Routing sidecar for request redirection | v0.5.0 | ghcr.io/llm-d/llm-d-routing-sidecar:v0.5.0 |
| llm-d-inference-sim | Lightweight vLLM simulator | v0.7.1 | ghcr.io/llm-d/llm-d-inference-sim:v0.7.1 |
Note: The following images have been deprecated in this release: llm-d-aws.
Getting Started​
Each component has its own detailed documentation page accessible from the sidebar. For a comprehensive view of how these components work together, see the main Architecture Overview.
Quick Links​
- Main llm-d Repository - Core platform and orchestration
- llm-d-incubation Organization - Experimental and supporting components
- Full Release Notes - Release v0.5.0
- All Releases - Complete release history
Previous Releases​
For information about previous versions and their features, visit the GitHub Releases page.
Contributing​
To contribute to any of these components, visit their respective repositories and follow their contribution guidelines. Each component maintains its own development workflow and contribution process.