一、概述

1.1 背景介绍

大模型服务上线后，最让运维头疼的不是部署——部署有文档照着走就行。真正折腾人的是上线之后各种花式报错：用户反馈"接口超时了"，告警群里跳出"GPU OOM"，API 日志里一堆 429 Too Many Requests。

这三类问题——超时、限流、OOM——占了大模型服务日常故障的 80% 以上。实测在一套跑了 Qwen/Qwen3.5-35B-A3B-FP8 的 vLLM 集群上，上线头两个月的 200 多次告警中，超时占 45%，OOM 占 30%，限流占 15%，其他杂项占 10%。

这篇文章把排查方法论和具体操作步骤都写清楚。不讲理论，直接给命令、给配置、给实际案例。每个错误类型从"看到什么现象"开始，到"怎么定位根因"，再到"怎么解决"，完整链路走一遍。

1.2 技术特点

这篇文章的组织方式：

系统化排错方法论：不是头痛医头脚痛医脚，而是建立一套从症状到根因的判断链
按错误类型分类：超时、限流、OOM、CUDA 错误、模型加载错误，每类独立成章
从症状到根因：先看到什么，再查什么，最后改什么。每一步都给具体命令
覆盖多层排查：客户端 → API 网关 → 推理引擎 → GPU/系统，逐层定位

1.3 适用场景

vLLM 0.8.x 推理服务：目前生产环境用得最多的推理引擎，本文重点覆盖
TGI（Text Generation Inference）：HuggingFace 出品，部分排查方法通用
Ollama：开发测试环境常用，OOM 问题最多
API 网关层：Nginx、Envoy、Kong 等反向代理的超时和限流配置
客户端调用层：Python requests / aiohttp / OpenAI SDK 的超时处理

二、详细步骤

2.1 报错分类体系

大模型服务的报错可以按照五大类来组织。先看清全局，再逐个击破。

大模型服务报错
├── 超时类
│   ├── TTFT 超时（首 token 延迟过高）
│   ├── TPS 下降（生成速度变慢）
│   ├── 连接超时（TCP 握手/HTTP 连接失败）
│   └── 网络超时（请求在网络层丢失）
├── 限流类
│   ├── HTTP 429 Too Many Requests
│   ├── 推理引擎内部队列满
│   └── Token 配额耗尽
├── OOM 类
│   ├── GPU OOM（显存不足）
│   ├── CPU OOM（主机内存不足）
│   └── 容器 OOM Killed（cgroup 限制触发）
├── CUDA 错误类
│   ├── CUDA OOM（和 GPU OOM 有区别）
│   ├── NCCL 通信错误（多卡场景）
│   ├── CUDA device-side assert
│   └── CUDA 版本不兼容
└── 模型加载类
    ├── 权重文件损坏/校验失败
    ├── 显存不足无法加载
    ├── safetensors 格式错误
    └── 磁盘空间不足

下面逐一详解。

2.2 超时问题排查

超时是最常见的问题，也是最容易被误判的。用户说"超时了"，可能是 TTFT 慢、可能是 TPS 低、可能是连接都建不上。第一步要区分到底是哪种超时。

2.2.1 快速区分超时类型

# 用 curl 测试端到端延迟，看各阶段耗时
curl -o /dev/null -s -w "\
  DNS:        %{time_namelookup}s\n\
  Connect:    %{time_connect}s\n\
  TLS:        %{time_appconnect}s\n\
  TTFB:       %{time_starttransfer}s\n\
  Total:      %{time_total}s\n" \
  -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"Qwen/Qwen3.5-35B-A3B-FP8","messages":[{"role":"user","content":"hello"}],"max_tokens":50}'

输出示例：

DNS:        0.001s
Connect:    0.002s
TLS:        0.000s
TTFB:       1.234s     ← 这就是 TTFT
Total:      3.456s

判断规则：

Connect > 1s → 连接超时，查网络
TTFB > 5s → TTFT 超时，查推理引擎
Total - TTFB 过大 → TPS 低，生成阶段慢
请求直接失败 → 看错误码，连接拒绝或超时

2.2.2 TTFT 超时排查

TTFT（Time To First Token）是用户发出请求到收到第一个 token 的时间。这个阶段主要是 prefill（预填充），计算量和 prompt 长度成正比。

正常基线（vLLM 0.7.x，单卡 A100 80G，Qwen/Qwen3.5-35B-A3B-FP8）：

Prompt 长度	TTFT 正常值	需要关注	需要告警
100 tokens	100-300ms	500ms+	1s+
1000 tokens	300-800ms	1.5s+	3s+
4000 tokens	1-3s	5s+	10s+
16000 tokens	3-8s	15s+	30s+

原因分析：

KV Cache 碎片化

vLLM 用 PagedAttention 管理 KV Cache，但长时间运行后会产生碎片，导致新请求分配 KV Cache 变慢。

# 检查 vLLM 的 KV Cache 使用情况
curl http://localhost:8000/metrics 2>/dev/null | grep "vllm:gpu_cache_usage_perc"

# 输出示例：
# vllm:gpu_cache_usage_perc 0.87
# 超过 0.9 就要注意了

解决：降低 gpu-memory-utilization 参数，给 KV Cache 更多空间，或者降低 max-num-seqs（最大并发请求数）。

长 Prompt Prefill 慢

Prefill 阶段的计算量和 prompt 长度的平方成正比（attention 计算）。一个 16K token 的 prompt 比 1K token 的 prompt 慢 16 倍不止。

# 监控 prefill 耗时
curl http://localhost:8000/metrics 2>/dev/null | grep "vllm:e2e_request_latency"

解决：

启用 chunked prefill（vLLM 0.7.x 默认启用）
对于超长 prompt，考虑拆分或摘要

GPU 算力不足

同时跑了太多并发请求，GPU 被压满。

# 实时看 GPU 利用率
nvidia-smi dmon -s u -d 1

# 输出字段说明：
# sm: SM 利用率（Streaming Multiprocessor）
# mem: 显存带宽利用率
# enc/dec: 编解码器利用率

# 看到 sm 持续 > 95% 说明 GPU 计算已经打满

解决：横向扩容（加 GPU），或降低 max-num-seqs 限制并发。

调度延迟

请求在 vLLM 的 scheduler 队列中等待时间过长。

# 检查等待队列长度
curl http://localhost:8000/metrics 2>/dev/null | grep "vllm:num_requests_waiting"

# 如果 waiting > 10，说明队列积压了

2.2.3 TPS 下降排查

TPS（Tokens Per Second）是生成阶段每秒输出的 token 数。TPS 下降意味着用户要等更长时间才能拿到完整回复。

正常基线（单卡 A100 80G，Qwen/Qwen3.5-35B-A3B-FP8）：

并发数	单请求 TPS	总 TPS	说明
1	60-80	60-80	单请求性能
10	30-50	300-500	batching 效果明显
50	15-25	750-1250	接近饱和
100	8-15	800-1500	超过最优点，单请求体验下降

原因分析：

Batch 队列满

# 检查当前 batch 中的请求数
curl http://localhost:8000/metrics 2>/dev/null | grep "vllm:num_requests_running"

# running 接近 max-num-seqs 就说明 batch 满了

显存不足触发 swap

vLLM 支持将 KV Cache swap 到 CPU 内存，但 swap 后 TPS 会暴跌。

# 检查 swap 次数
curl http://localhost:8000/metrics 2>/dev/null | grep "vllm:cpu_cache_usage_perc"

# 如果 cpu_cache_usage > 0，说明在 swap

解决：降低 max-num-seqs 或 max-model-len，减少 KV Cache 需求。

GPU 温度降频

A100 在 85°C 以上会自动降频，TPS 直接打折。

# 检查 GPU 温度
nvidia-smi --query-gpu=temperature.gpu --format=csv,noheader,nounits

# 正常工作温度：60-80°C
# > 83°C：开始降频
# > 90°C：严重降频或关机保护

解决：检查散热系统、风扇转速、机房温度。数据中心环境一般不会有这个问题，但用 RTX 4090 的工作站经常遇到。

NUMA 亲和性问题

多路 CPU 服务器上，如果 vLLM 进程和 GPU 不在同一个 NUMA 节点，PCIe 传输走跨 NUMA 路径，带宽减半。

# 查看 GPU 所在 NUMA 节点
nvidia-smi topo -m

# 查看进程的 NUMA 绑定
numactl -s

# 绑定到 GPU 所在 NUMA 节点启动 vLLM
numactl --cpunodebind=0 --membind=0 python -m vllm.entrypoints.openai.api_server ...

2.2.4 连接超时排查

连接超时意味着请求根本没到推理引擎。排查路径是从客户端到服务端逐层检查。

# 第一步：检查服务端口是否在监听
ss -tlnp | grep 8000

# 预期输出：
# LISTEN  0  128  0.0.0.0:8000  0.0.0.0:*  users:(("python",pid=12345,fd=7))

# 如果没有输出，说明 vLLM 没启动或端口不对

# 第二步：检查防火墙
iptables -L -n | grep 8000
# 或者用 firewalld
firewall-cmd --list-all

# 第三步：检查网络连通性
# 从客户端机器测试
telnet <server_ip> 8000
# 或
nc -zv <server_ip> 8000

# 第四步：如果通过反向代理，检查 Nginx/Envoy 状态
curl -I http://localhost:80/v1/models
# 看返回状态码和 upstream 信息

vLLM 服务未就绪的常见原因：

# 检查 vLLM 进程是否存在
ps aux | grep "vllm.entrypoints" | grep -v grep

# 如果进程存在但端口没监听，说明还在加载模型
# 查看 vLLM 日志
journalctl -u vllm -f
# 或
docker logs vllm-server -f 2>&1 | tail -50

# 模型加载中的典型日志：
# INFO: Loading model weights...
# INFO: Loading model weights took 45.23 seconds
# INFO: Uvicorn running on http://0.0.0.0:8000
# 看到 Uvicorn running 才说明服务就绪

2.2.5 超时参数调优指南

大模型服务的超时配置涉及三层：客户端、网关、推理引擎。三层要配合，否则会出现"客户端已经超时了，但服务端还在算"的资源浪费问题。

超时链路示意：

客户端超时 > 网关超时 > 推理引擎超时（应从外到内递减）

等等，这个逻辑是反过来的——外层的超时应该大于或等于内层，否则内层还没超时外层就断了。

正确的配置：

推理引擎内部超时: 120s（最长允许执行时间）
Nginx proxy_read_timeout: 150s（留 30s 缓冲）
客户端 timeout: 180s（再留 30s 缓冲）

vLLM 超时参数：

# vLLM 启动参数
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3.5-35B-A3B-FP8 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.9 \
  --max-num-seqs 64 \
  --request-timeout 120        # 单请求最大处理时间（秒）

Nginx 超时配置：

# /etc/nginx/conf.d/llm-proxy.conf
upstream vllm_backend {
    server 127.0.0.1:8000;
    keepalive 32;
}

server {
    listen 80;
    server_name llm.example.com;

    # 客户端相关超时
    client_body_timeout 30s;
    client_header_timeout 10s;

    # 代理相关超时
    proxy_connect_timeout 10s;     # 连接后端的超时
    proxy_send_timeout 30s;        # 发送请求到后端的超时
    proxy_read_timeout 150s;       # 等待后端响应的超时（要大于 vLLM 的 request-timeout）

    # SSE（Server-Sent Events）支持，stream 模式必须
    proxy_buffering off;
    proxy_cache off;

    location /v1/ {
        proxy_pass http://vllm_backend;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header Connection '';
        proxy_http_version 1.1;
        chunked_transfer_encoding on;
    }
}

Python 客户端超时配置：

import openai

client = openai.OpenAI(
    base_url="http://llm.example.com/v1",
    api_key="not-needed",
    timeout=openai.Timeout(
        connect=10.0,   # 连接超时
        read=180.0,     # 读取超时（等待响应）
        write=30.0,     # 写入超时（发送请求）
        pool=10.0,      # 连接池超时
    ),
    max_retries=3,       # 自动重试次数
)

# stream 模式下的超时配置
response = client.chat.completions.create(
    model="Qwen/Qwen3.5-35B-A3B-FP8",
    messages=[{"role": "user", "content": "写一首诗"}],
    max_tokens=500,
    stream=True,
    timeout=180,
)

2.3 限流问题排查

2.3.1 HTTP 429 Too Many Requests 排查

看到 429 状态码，先确认是哪一层返回的。

# 测试请求，看响应头
curl -v -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"Qwen/Qwen3.5-35B-A3B-FP8","messages":[{"role":"user","content":"hi"}],"max_tokens":10}' \
  2>&1 | grep -E "^< (HTTP|X-|Retry|RateLimit)"

# 关键响应头：
# Retry-After: 5                 → 几秒后重试
# X-RateLimit-Limit: 100         → 速率上限
# X-RateLimit-Remaining: 0       → 剩余配额
# X-RateLimit-Reset: 1710300000  → 配额重置时间

判断限流来源：

响应头特征	来源	处理方式
`X-RateLimit-*` 头	API 网关层	调整网关限流配置
响应体包含 `vllm` 关键字	vLLM 引擎	调整 vLLM 并发参数
`Server: nginx` 或 `Server: envoy`	反向代理层	调整代理限流配置
响应体包含 `token quota`	Token 计费层	充值或联系管理员

2.3.2 vLLM 内部队列限制

vLLM 有几个参数直接控制了它能同时处理多少请求：

# 查看当前运行的 vLLM 启动参数
ps aux | grep vllm | grep -v grep

# 关键参数说明：
# --max-num-seqs 256        : 最大同时处理的序列数
# --max-num-batched-tokens 32768 : 每个 batch 最大 token 数
# --max-model-len 32768     : 单请求最大上下文长度

max-num-seqs 调优：

这个参数控制 vLLM 同时处理的最大请求数。设太小会限流，设太大会 OOM。

推荐计算公式：
max-num-seqs = 可用显存 / (平均 KV Cache per request)

A100 80G，gpu-memory-utilization=0.9，Qwen3.5-35B-A3B-FP8 模型占 18GB：
可用 KV Cache = 80 × 0.9 - 18 = 54 GB
假设平均每请求用 200MB KV Cache（平均 context 2048 tokens）：
max-num-seqs ≈ 54000 / 200 = 270

实际要留安全裕量，建议设 200-256

max-num-batched-tokens 调优：

# 这个参数限制每个调度周期中所有请求的 token 总数
# 设太小：GPU 利用率低，吞吐量差
# 设太大：prefill 延迟高，因为一个长请求可能占满整个 batch

# 推荐值：
# 短对话场景（平均 < 2K tokens）：max-num-batched-tokens = 16384
# 长文档场景（平均 > 4K tokens）：max-num-batched-tokens = 32768
# RAG 场景（prompt 长、response 短）：max-num-batched-tokens = 65536

2.3.3 Nginx 层限流配置

# /etc/nginx/conf.d/llm-ratelimit.conf

# 定义限流区域
# rate=50r/s 表示每秒 50 个请求
# zone=llm_limit:10m 表示用 10MB 内存存储状态（约能跟踪 16万个 IP）
limit_req_zone $binary_remote_addr zone=llm_limit:10m rate=50r/s;

# 基于 API Key 的限流（更细粒度）
map $http_authorization $api_key {
    default "";
    "~Bearer (.+)" $1;
}
limit_req_zone $api_key zone=llm_key_limit:10m rate=20r/s;

server {
    listen 80;

    location /v1/chat/completions {
        # 允许突发 20 个请求，不延迟
        limit_req zone=llm_limit burst=20 nodelay;

        # 基于 API Key 的限流
        limit_req zone=llm_key_limit burst=10 nodelay;

        # 限流时返回 429 而不是默认的 503
        limit_req_status 429;

        proxy_pass http://vllm_backend;
    }
}

2.3.4 Token Bucket 机制说明

限流最常用的算法是 Token Bucket（令牌桶），理解这个机制有助于调参：

令牌桶工作原理：
1. 桶以固定速率 rate 填充令牌
2. 桶最多容纳 burst 个令牌（突发容量）
3. 每个请求消耗一个令牌
4. 桶空时，新请求被拒绝（返回 429）

示例：rate=50r/s, burst=20
- 正常情况：每秒稳定处理 50 个请求
- 突发情况：瞬间可处理 50+20=70 个请求
- 持续高压：稳定在 50r/s，超出的被 429

2.3.5 限流配置调优建议

场景	rate 建议	burst 建议	说明
内部 RAG 系统	100r/s per IP	50	用户量小，单用户可能连续查询
对外 API 服务	10r/s per API Key	20	防止单用户打爆
批量处理	200r/s 全局	100	后端处理能力的 80%
开发测试	不限流	-	测试环境放开

2.4 OOM 问题排查

OOM 是最头疼的问题，因为它通常导致服务直接挂掉，而且重启后如果不改配置还会继续挂。

2.4.1 GPU OOM 排查

症状：vLLM 日志出现 torch.cuda.OutOfMemoryError 或 CUDA error: out of memory，服务崩溃或卡死。

第一步：看显存占用

# 实时监控显存
nvidia-smi -l 1

# 或者更详细的显存分配
nvidia-smi --query-gpu=index,name,memory.total,memory.used,memory.free,utilization.gpu --format=csv

# 输出示例：
# 0, NVIDIA A100-SXM4-80GB, 81920 MiB, 72448 MiB, 9472 MiB, 95 %

# 看每个进程的显存占用
nvidia-smi --query-compute-apps=pid,name,used_memory --format=csv

# 输出示例：
# 12345, python, 71680 MiB

第二步：检查 vLLM 的显存分配

# vLLM 启动日志中会打印显存分配信息
docker logs vllm-server 2>&1 | grep -E "GPU|memory|cache"

# 典型输出：
# INFO: GPU 0: NVIDIA A100-SXM4-80GB (80 GiB)
# INFO: Model weights: 18.2 GiB
# INFO: KV cache: 54.0 GiB (gpu_memory_utilization=0.9)
# INFO: Available KV cache blocks: 27648

第三步：分析 OOM 原因

GPU 显存的构成：

总显存 = 模型权重 + KV Cache + 激活值 + CUDA 上下文 + 碎片

A100 80GB 的典型分配（Qwen/Qwen3.5-35B-A3B-FP8）：
  模型权重:  ~18 GB（FP8 量化后）
  CUDA 上下文: ~2 GB
  可用空间:  ~60 GB
  gpu-memory-utilization=0.9 时：
    KV Cache: 60 × 0.9 = 54 GB
    预留:     60 × 0.1 = 6 GB（给激活值和碎片）

OOM 场景分析：

场景	原因	典型日志
启动时 OOM	模型太大，显存装不下	`OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB`
运行时 OOM	KV Cache 耗尽	`No available memory for the request`
长对话 OOM	单请求 context 超长	`Request exceeds max_model_len`
并发高时 OOM	同时请求太多	`All cache blocks are allocated`

第四步：解决方案

启动时 OOM — 模型太大

# 方案 A：降低精度（如果模型支持）
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3.5-35B-A3B-FP8 \
  --dtype auto \
  --quantization fp8   # FP8 量化

# 方案 B：多卡张量并行
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3.5-35B-A3B-FP8 \
  --tensor-parallel-size 2   # 用 2 张卡

# 方案 C：降低 gpu-memory-utilization
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3.5-35B-A3B-FP8 \
  --gpu-memory-utilization 0.85   # 从 0.9 降到 0.85

运行时 OOM — KV Cache 不够

# 降低最大并发
--max-num-seqs 128          # 从 256 降到 128

# 降低最大上下文长度
--max-model-len 16384       # 从 32768 降到 16384

# 或者两个都降
--max-num-seqs 128 --max-model-len 16384

长对话 OOM — 限制单请求长度

# vLLM 启动参数限制
--max-model-len 8192

# 应用层也要做限制，别完全依赖 vLLM
# 在 API 网关层截断超长请求

2.4.2 CPU OOM 排查

症状：服务器响应缓慢或被 OOM Killer 杀死，dmesg 中有 Out of memory: Killed process 日志。

# 第一步：检查 OOM 日志
dmesg | grep -i "oom\|killed" | tail -20

# 输出示例：
# [123456.789] Out of memory: Killed process 12345 (python) total-vm:98765432kB, anon-rss:65432100kB, file-rss:1234kB
# 注意 anon-rss 就是进程实际使用的物理内存

# 第二步：检查当前内存状况
free -h

# 输出示例：
#               total        used        free      shared  buff/cache   available
# Mem:           125Gi       98Gi       2.1Gi       1.2Gi        25Gi        24Gi
# Swap:            0B          0B          0B

# 第三步：查看内存大户
ps aux --sort=-%mem | head -10

# 第四步：查看详细内存分配
cat /proc/meminfo | grep -E "MemTotal|MemFree|MemAvailable|SwapTotal|SwapFree|Cached|Buffers"

CPU OOM 的常见原因：

模型加载时的内存峰值

vLLM 加载模型时需要先在 CPU 内存中读取权重文件，然后再传到 GPU。一个 35B 参数的 FP8 模型，CPU 侧需要约 35GB 内存。如果同时还跑了其他服务，很容易 OOM。

# 查看模型文件大小（估算 CPU 内存需求）
du -sh /models/Qwen3.5-35B-A3B-FP8/

# 预留 CPU 内存 = 模型文件大小 × 1.5 + 操作系统和其他服务需求

Tokenizer 内存泄漏

部分 tokenizer 在处理大量请求后会出现内存缓慢增长的问题。线上跑了 3 天后 CPU 内存从 20GB 涨到 45GB 就是这个坑。

# 监控进程内存变化趋势
while true; do
    echo "$(date) $(ps aux | grep 'vllm' | grep -v grep | awk '{print $6}') KB"
    sleep 60
done >> /var/log/vllm_mem_monitor.log

解决：定期重启 vLLM 服务（比如每天凌晨），或者升级到最新版本看是否修复。

swap 未配置

生产环境的 GPU 服务器经常不配 swap，内存一满就直接 OOM Kill。

# 临时添加 swap（应急）
sudo fallocate -l 32G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile

# 永久生效
echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab

但要注意：swap 会严重影响性能，只适合兜底，不能当常态。

2.4.3 容器 OOM Killed 排查

症状：kubectl describe pod 或 docker inspect 显示 OOMKilled。

# Kubernetes 环境
kubectl describe pod vllm-pod-xxxxx | grep -A5 "Last State"

# 输出示例：
#     Last State:     Terminated
#       Reason:       OOMKilled
#       Exit Code:    137
#       Started:      Mon, 10 Mar 2026 14:00:00 +0800
#       Finished:     Mon, 10 Mar 2026 16:32:15 +0800

# 查看容器内存限制
kubectl get pod vllm-pod-xxxxx -o jsonpath='{.spec.containers[*].resources}'

# Docker 环境
docker inspect vllm-server --format='{{.State.OOMKilled}}'
docker inspect vllm-server --format='{{.HostConfig.Memory}}'

根因：容器内存限制（resources.limits.memory）设得比实际需求小。

容器内存需求计算：

容器内存 = CPU 内存（模型加载 + tokenizer + 请求缓冲）+ 系统开销

实测 Qwen/Qwen3.5-35B-A3B-FP8 的 CPU 内存需求：
  模型加载峰值: ~35 GB
  运行时稳态:   ~20 GB
  tokenizer:    ~2 GB
  请求缓冲:     ~3 GB
  系统开销:     ~3 GB
  总计:         ~63 GB（峰值）/ ~28 GB（稳态）

解决方案：

# Kubernetes Pod 资源配置
resources:
requests:
    memory:"32Gi"        # 稳态需求
    cpu:"8"
    nvidia.com/gpu:"1"
limits:
    memory:"72Gi"        # 峰值需求 + 20% 缓冲
    cpu:"16"
    nvidia.com/gpu:"1"

# Docker 环境
docker run -d \
  --gpus '"device=0"' \
  --memory=72g \
  --memory-swap=72g \
  --shm-size=8g \
  --name vllm-server \
  vllm/vllm-openai:v0.7.6 \
  --model Qwen/Qwen3.5-35B-A3B-FP8 \
  --gpu-memory-utilization 0.9

2.4.4 OOM 根因速查表

症状	检查命令	根因	解决方案
vLLM 日志 `OutOfMemoryError`	`nvidia-smi`	GPU 显存不足	降低 gpu-memory-utilization / 减少 max-num-seqs
dmesg `Killed process`	`dmesg \| grep oom`	CPU 内存不足	加内存 / 加 swap / 减少并发
Pod OOMKilled	`kubectl describe pod`	容器内存限制太小	调大 resources.limits.memory
vLLM `All cache blocks are allocated`	查看 metrics	KV Cache 用完	降低 max-num-seqs 或 max-model-len
内存持续增长	持续监控 RSS	内存泄漏	定期重启 / 升级版本
启动时 OOM	`du -sh model/`	模型太大	用量化版本 / 多卡并行

2.5 其他常见错误

2.5.1 CUDA Error: out of memory vs GPU OOM

很多人分不清 CUDA error: out of memory 和 torch.cuda.OutOfMemoryError 的区别。

torch.cuda.OutOfMemoryError：PyTorch 层面的内存分配失败，通常是 KV Cache 或激活值分配不到显存。vLLM 通常能捕获这个错误并优雅降级（拒绝新请求，不会崩溃）。
CUDA error: out of memory：CUDA 驱动层面的错误，更底层。一旦出现，整个 CUDA context 可能损坏，需要重启进程。

# 如果 CUDA context 损坏，需要重置 GPU
nvidia-smi --gpu-reset -i 0

# 注意：gpu-reset 会杀死该 GPU 上所有进程
# 先确认没有其他重要进程在用这张卡
nvidia-smi --query-compute-apps=pid,name,used_memory --format=csv -i 0

2.5.2 CUDA Error: device-side assert triggered

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.

这个错误通常是模型推理过程中的数值异常（NaN/Inf）触发了 CUDA kernel 的 assert。

常见原因：

模型权重文件损坏
输入数据中有异常值
FP16 精度溢出

# 排查步骤

# 1. 检查模型文件完整性
python -c "
from safetensors import safe_open
import os
model_dir = '/models/Qwen3.5-35B-A3B-FP8'
for f in os.listdir(model_dir):
    if f.endswith('.safetensors'):
        try:
            with safe_open(os.path.join(model_dir, f), framework='pt') as sf:
                for key in sf.keys():
                    tensor = sf.get_tensor(key)
                    if tensor.isnan().any() or tensor.isinf().any():
                        print(f'WARNING: {f}/{key} contains NaN/Inf!')
            print(f'OK: {f}')
        except Exception as e:
            print(f'ERROR: {f}: {e}')
"

# 2. 启用 CUDA_LAUNCH_BLOCKING 获取精确堆栈
CUDA_LAUNCH_BLOCKING=1 python -m vllm.entrypoints.openai.api_server ...

# 3. 如果是 FP16 溢出，尝试用 BF16
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3.5-35B-A3B-FP8 \
  --dtype bfloat16

2.5.3 RuntimeError: NCCL communicator was aborted

多卡推理（tensor parallel）时 NCCL 通信失败。

RuntimeError: NCCL communicator was aborted on rank 1. Original reason: ...

常见原因和解决方案：

# 1. GPU 之间的 NVLink/PCIe 通信故障
# 检查 GPU 拓扑
nvidia-smi topo -m

# 2. NCCL 版本和 CUDA 版本不匹配
python -c "import torch; print(torch.cuda.nccl.version())"

# 3. 进程间通信端口冲突
# 设置 NCCL 使用不同端口
export NCCL_SOCKET_IFNAME=eth0
export NCCL_IB_DISABLE=1    # 禁用 InfiniBand（如果没有 IB 网卡）

# 4. 共享内存不足
# 检查 /dev/shm 大小
df -h /dev/shm

# Docker 环境需要加大 shm-size
docker run --shm-size=8g ...

# Kubernetes 环境
# volumes:
#   - name: dshm
#     emptyDir:
#       medium: Memory
#       sizeLimit: 8Gi

2.5.4 模型加载失败

权重文件损坏：

# 检查文件 hash
sha256sum /models/Qwen3.5-35B-A3B-FP8/*.safetensors

# 和 HuggingFace 上的 hash 对比
# 如果不一致，重新下载

# 使用 huggingface-cli 下载（支持断点续传和校验）
pip install huggingface_hub
huggingface-cli download Qwen/Qwen3.5-35B-A3B-FP8 \
  --local-dir /models/Qwen3.5-35B-A3B-FP8 \
  --local-dir-use-symlinks False

磁盘空间不足：

# 检查模型目录所在磁盘空间
df -h /models/

# 模型文件大小参考：
# Qwen3.5-35B-A3B-FP8:  ~35 GB
# Qwen3-Coder-30B-A3B-FP8: ~30 GB
# 加上 tokenizer 和配置文件，预留 40-50 GB

# 清理旧模型释放空间
du -sh /models/*/  | sort -h

safetensors 格式错误：

# 验证 safetensors 文件
python -c "
from safetensors import safe_open
try:
    with safe_open('/models/Qwen3.5-35B-A3B-FP8/model-00001-of-00008.safetensors', framework='pt') as f:
        print('Keys:', list(f.keys())[:5])
        print('File is valid')
except Exception as e:
    print(f'File is corrupted: {e}')
"

2.5.5 torch.cuda.OutOfMemoryError 具体处理

这是最常见的 CUDA 错误，具体处理流程：

# 完整的排查流程脚本
#!/bin/bash

echo"=== GPU Memory Status ==="
nvidia-smi --query-gpu=index,memory.total,memory.used,memory.free --format=csv

echo""
echo"=== GPU Processes ==="
nvidia-smi --query-compute-apps=pid,name,used_memory --format=csv

echo""
echo"=== vLLM Config Check ==="
ps aux | grep vllm | grep -v grep | sed 's/.*python/python/'

echo""
echo"=== Recent OOM Logs ==="
if [ -f /var/log/vllm/vllm.log ]; then
    grep -i "OutOfMemory\|out of memory\|OOM" /var/log/vllm/vllm.log | tail -5
else
    journalctl -u vllm --since "1 hour ago" | grep -i "OutOfMemory\|out of memory\|OOM" | tail -5
fi

echo""
echo"=== System Memory ==="
free -h

echo""
echo"=== Recommendation ==="
GPU_USED=$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits -i 0)
GPU_TOTAL=$(nvidia-smi --query-gpu=memory.total --format=csv,noheader,nounits -i 0)
USAGE_PCT=$((GPU_USED * 100 / GPU_TOTAL))

if [ "$USAGE_PCT" -gt 95 ]; then
    echo"GPU memory usage ${USAGE_PCT}% - CRITICAL"
    echo"Suggestions:"
    echo"  1. Reduce --max-num-seqs"
    echo"  2. Reduce --max-model-len"
    echo"  3. Reduce --gpu-memory-utilization"
    echo"  4. Use tensor parallelism (--tensor-parallel-size 2)"
elif [ "$USAGE_PCT" -gt 85 ]; then
    echo"GPU memory usage ${USAGE_PCT}% - WARNING"
    echo"Consider reducing max-num-seqs by 20%"
else
    echo"GPU memory usage ${USAGE_PCT}% - OK"
fi

三、示例代码和配置

3.1 OOM 自动恢复脚本

#!/bin/bash
# vllm_watchdog.sh - vLLM OOM 自动恢复 + 告警
# 部署方式：systemd service 或 supervisor

VLLM_CMD="python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3.5-35B-A3B-FP8 \
  --host 0.0.0.0 \
  --port 8000 \
  --gpu-memory-utilization 0.88 \
  --max-num-seqs 200 \
  --max-model-len 16384 \
  --dtype auto"

HEALTH_URL="http://localhost:8000/health"
ALERT_WEBHOOK="http://alertmanager:9093/api/v1/alerts"
LOG_FILE="/var/log/vllm/watchdog.log"
MAX_RESTARTS=5
RESTART_COUNT=0
COOLDOWN=60

log() {
    echo"[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"
}

send_alert() {
    local severity=$1
    local message=$2
    curl -s -X POST "$ALERT_WEBHOOK" \
      -H "Content-Type: application/json" \
      -d "[{
        \"labels\": {
          \"alertname\": \"VLLMWatchdog\",
          \"severity\": \"$severity\",
          \"instance\": \"$(hostname)\"
        },
        \"annotations\": {
          \"summary\": \"$message\"
        }
      }]" > /dev/null 2>&1
    log"ALERT [$severity]: $message"
}

check_gpu_health() {
    # 检查 GPU 是否正常
    if ! nvidia-smi > /dev/null 2>&1; then
        log"GPU health check failed - nvidia-smi error"
        return 1
    fi

    # 检查 GPU 温度
    local temp
    temp=$(nvidia-smi --query-gpu=temperature.gpu --format=csv,noheader,nounits -i 0)
    if [ "$temp" -gt 90 ]; then
        log"GPU temperature too high: ${temp}°C"
        send_alert "critical""GPU temperature ${temp}°C exceeds 90°C threshold"
        return 1
    fi

    return 0
}

start_vllm() {
    log"Starting vLLM..."
    $VLLM_CMD >> /var/log/vllm/vllm.log 2>&1 &
    VLLM_PID=$!
    log"vLLM started with PID $VLLM_PID"

    # 等待服务就绪（模型加载需要时间）
    local wait_count=0
    local max_wait=300  # 最多等 5 分钟
    while [ $wait_count -lt $max_wait ]; do
        if curl -s "$HEALTH_URL" > /dev/null 2>&1; then
            log"vLLM service is ready"
            return 0
        fi
        sleep 2
        wait_count=$((wait_count + 2))
    done

    log"vLLM failed to start within ${max_wait}s"
    kill$VLLM_PID 2>/dev/null
    return 1
}

# 主循环
log"=== vLLM Watchdog Started ==="

whiletrue; do
    # 检查 GPU 健康
    if ! check_gpu_health; then
        log"GPU unhealthy, waiting for recovery..."
        sleep 30
        continue
    fi

    # 启动 vLLM
    if start_vllm; then
        RESTART_COUNT=0
        send_alert "info""vLLM service started successfully"

        # 健康检查循环
        whiletrue; do
            sleep 15

            if ! kill -0 $VLLM_PID 2>/dev/null; then
                # 进程不存在了
                log"vLLM process $VLLM_PID is gone"

                # 检查是否 OOM
                if dmesg | tail -20 | grep -qi "oom.*$VLLM_PID\|killed.*$VLLM_PID"; then
                    log"OOM detected!"
                    send_alert "critical""vLLM OOM Killed, PID=$VLLM_PID, restarting..."
                else
                    log"vLLM exited unexpectedly"
                    send_alert "warning""vLLM process died, PID=$VLLM_PID, restarting..."
                fi
                break
            fi

            # HTTP 健康检查
            if ! curl -s --max-time 10 "$HEALTH_URL" > /dev/null 2>&1; then
                log"Health check failed, checking process..."
                sleep 5
                if ! curl -s --max-time 10 "$HEALTH_URL" > /dev/null 2>&1; then
                    log"Health check failed twice, killing vLLM"
                    send_alert "warning""vLLM health check failed, restarting..."
                    kill -9 $VLLM_PID 2>/dev/null
                    sleep 5
                    break
                fi
            fi
        done
    fi

    # 重启次数限制
    RESTART_COUNT=$((RESTART_COUNT + 1))
    if [ $RESTART_COUNT -ge $MAX_RESTARTS ]; then
        send_alert "critical""vLLM exceeded max restarts ($MAX_RESTARTS), giving up"
        log"Max restarts reached, exiting watchdog"
        exit 1
    fi

    log"Restarting in ${COOLDOWN}s (attempt $RESTART_COUNT/$MAX_RESTARTS)"
    sleep $COOLDOWN
done

配套的 systemd service：

# /etc/systemd/system/vllm-watchdog.service
[Unit]
Description=vLLM Watchdog Service
After=network.target nvidia-persistenced.service

[Service]
Type=simple
User=root
ExecStart=/opt/scripts/vllm_watchdog.sh
Restart=on-failure
RestartSec=30
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target

# 启用服务
sudo systemctl daemon-reload
sudo systemctl enable vllm-watchdog
sudo systemctl start vllm-watchdog

# 查看状态
sudo systemctl status vllm-watchdog
journalctl -u vllm-watchdog -f

3.2 限流配置模板

Nginx rate_limit 完整模板：

# /etc/nginx/conf.d/llm-gateway.conf

# 限流配置
limit_req_zone $binary_remote_addr zone=per_ip:10m rate=30r/s;
limit_req_zone $api_key zone=per_key:10m rate=10r/s;
limit_conn_zone $binary_remote_addr zone=per_ip_conn:10m;

# API Key 提取
map $http_authorization $api_key {
    default "anonymous";
    "~*Bearer\s+(.+)" $1;
}

# 请求大小限制（防止超长 prompt）
client_max_body_size 10m;

upstream vllm_servers {
    server 10.0.1.10:8000 weight=1;
    server 10.0.1.11:8000 weight=1;
    keepalive 64;
}

server {
    listen 443 ssl;
    server_name llm-api.example.com;

    ssl_certificate /etc/nginx/ssl/cert.pem;
    ssl_certificate_key /etc/nginx/ssl/key.pem;

    # 全局超时
    proxy_connect_timeout 10s;
    proxy_send_timeout 30s;
    proxy_read_timeout 180s;

    # SSE 支持
    proxy_buffering off;

    # Chat Completions（最常用，重点限流）
    location /v1/chat/completions {
        limit_req zone=per_ip burst=20 nodelay;
        limit_req zone=per_key burst=5 nodelay;
        limit_conn per_ip_conn 10;    # 单 IP 最大 10 个并发连接
        limit_req_status 429;
        limit_conn_status 429;

        proxy_pass http://vllm_servers;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_http_version 1.1;
        proxy_set_header Connection '';
    }

    # Completions
    location /v1/completions {
        limit_req zone=per_ip burst=20 nodelay;
        limit_req zone=per_key burst=5 nodelay;
        limit_req_status 429;

        proxy_pass http://vllm_servers;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_http_version 1.1;
        proxy_set_header Connection '';
    }

    # Embeddings（计算量小，可以放宽）
    location /v1/embeddings {
        limit_req zone=per_ip burst=50 nodelay;
        limit_req_status 429;

        proxy_pass http://vllm_servers;
        proxy_set_header Host $host;
    }

    # Models list（不限流）
    location /v1/models {
        proxy_pass http://vllm_servers;
    }

    # Health check（不限流）
    location /health {
        proxy_pass http://vllm_servers;
    }

    # 自定义 429 响应体（符合 OpenAI API 格式）
    error_page 429 @rate_limited;
    location @rate_limited {
        default_type application/json;
        return 429 '{"error":{"message":"Rate limit exceeded. Please retry after a short wait.","type":"rate_limit_error","code":"rate_limit_exceeded"}}';
    }
}

vLLM 参数调优配置：

# 中等负载配置（单卡 A100 80G，Qwen3.5-35B-A3B-FP8）
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3.5-35B-A3B-FP8 \
  --host 0.0.0.0 \
  --port 8000 \
  --dtype auto \
  --gpu-memory-utilization 0.88 \
  --max-num-seqs 200 \
  --max-num-batched-tokens 32768 \
  --max-model-len 16384 \
  --enable-chunked-prefill \
  --disable-log-requests

# 高吞吐配置（牺牲单请求延迟换总吞吐）
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3.5-35B-A3B-FP8 \
  --host 0.0.0.0 \
  --port 8000 \
  --dtype auto \
  --gpu-memory-utilization 0.92 \
  --max-num-seqs 512 \
  --max-num-batched-tokens 65536 \
  --max-model-len 8192 \
  --enable-chunked-prefill \
  --disable-log-requests

# 低延迟配置（牺牲吞吐换单请求速度）
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3.5-35B-A3B-FP8 \
  --host 0.0.0.0 \
  --port 8000 \
  --dtype auto \
  --gpu-memory-utilization 0.85 \
  --max-num-seqs 32 \
  --max-num-batched-tokens 8192 \
  --max-model-len 32768 \
  --disable-log-requests

3.3 超时参数调优配置

三层超时的完整配置模板：

# client_config.py - Python 客户端最佳配置

import openai
import httpx

# 标准配置
client = openai.OpenAI(
    base_url="https://llm-api.example.com/v1",
    api_key="your-api-key",
    timeout=openai.Timeout(
        connect=10.0,
        read=180.0,
        write=30.0,
        pool=10.0,
    ),
    max_retries=3,
    http_client=httpx.Client(
        limits=httpx.Limits(
            max_connections=100,
            max_keepalive_connections=20,
            keepalive_expiry=30,
        ),
    ),
)

# 异步客户端
async_client = openai.AsyncOpenAI(
    base_url="https://llm-api.example.com/v1",
    api_key="your-api-key",
    timeout=openai.Timeout(
        connect=10.0,
        read=180.0,
        write=30.0,
        pool=10.0,
    ),
    max_retries=3,
    http_client=httpx.AsyncClient(
        limits=httpx.Limits(
            max_connections=200,
            max_keepalive_connections=50,
            keepalive_expiry=30,
        ),
    ),
)


# 带重试和降级的调用封装
import time
import logging

logger = logging.getLogger(__name__)


def call_llm_with_fallback(
    messages: list,
    model: str = "Qwen/Qwen3.5-35B-A3B-FP8",
    max_tokens: int = 2048,
    max_retries: int = 3,
    fallback_model: str = "Qwen/Qwen3-Coder-30B-A3B-FP8",
) -> str:
    """带重试和降级的 LLM 调用"""

    last_error = None

    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model=model,
                messages=messages,
                max_tokens=max_tokens,
                timeout=120,
            )
            return response.choices[0].message.content

        except openai.RateLimitError as e:
            # 429 限流，等待后重试
            retry_after = int(e.response.headers.get("Retry-After", 5))
            logger.warning(f"Rate limited, retry after {retry_after}s (attempt {attempt + 1})")
            time.sleep(retry_after)
            last_error = e

        except openai.APITimeoutError as e:
            # 超时，指数退避重试
            wait_time = 2 ** attempt
            logger.warning(f"Timeout, retry after {wait_time}s (attempt {attempt + 1})")
            time.sleep(wait_time)
            last_error = e

        except openai.APIConnectionError as e:
            # 连接错误，快速重试
            logger.warning(f"Connection error, retry (attempt {attempt + 1}): {e}")
            time.sleep(1)
            last_error = e

        except openai.InternalServerError as e:
            # 500 错误，可能是 OOM，切换备用模型
            logger.error(f"Server error: {e}")
            if fallback_model and model != fallback_model:
                logger.info(f"Falling back to {fallback_model}")
                model = fallback_model
            last_error = e

    # 所有重试失败
    logger.error(f"All retries failed: {last_error}")
    raise last_error

3.4 案例 1：生产环境 GPU OOM 完整排查

时间线：

2026年2月15日，周六凌晨 3 点，告警群收到：

[CRITICAL] vLLM service down on gpu-server-03
Process exited with code 137 (OOM Killed)

排查过程：

03:05 - 登录服务器，确认服务状态

ssh gpu-server-03

# 查看进程
ps aux | grep vllm
# 没有 vLLM 进程

# 查看 docker 状态
docker ps -a | grep vllm
# STATUS: Exited (137) 3 minutes ago

# 确认是 OOM
docker inspect vllm-server --format='{{.State.OOMKilled}}'
# true

03:08 - 查看系统日志确认 OOM 详情

dmesg | tail -30
# [Sat Feb 15 02:58:32 2026] oom-kill: constraint=CONSTRAINT_MEMCG, ...
# [Sat Feb 15 02:58:32 2026] Memory cgroup out of memory: Killed process 28456 (python)
#   total-vm:156789012kB, anon-rss:73400320kB

# 73GB RSS，容器限制是 72GB

03:10 - 查看 GPU 显存（此时 GPU 应该已释放）

nvidia-smi
# GPU 0: A100 80GB, 0MiB / 81920MiB used
# 正常，进程已经死了

03:12 - 查看 vLLM 日志，找 OOM 前的线索

docker logs vllm-server 2>&1 | tail -100

# 关键日志：
# [02:55:10] INFO: Received request cmpl-abc123, prompt_tokens=28672, max_tokens=4096
# [02:55:11] INFO: Received request cmpl-abc124, prompt_tokens=31000, max_tokens=2048
# [02:55:12] WARNING: Preempting 3 sequences due to insufficient KV cache
# [02:57:45] INFO: Received request cmpl-abc189, prompt_tokens=30500, max_tokens=8192
# [02:58:30] ERROR: torch.cuda.OutOfMemoryError: CUDA out of memory
# [02:58:31] CRITICAL: Fatal error, shutting down

03:15 - 定位根因

凌晨 2:55-2:58 期间，大量超长 prompt 请求涌入。这些请求的 prompt 长度都在 28K-31K tokens，接近 max-model-len=32768 的上限。

查看启动参数：

docker logs vllm-server 2>&1 | head -20
# --max-model-len 32768
# --max-num-seqs 256
# --gpu-memory-utilization 0.9

问题出在 max-model-len=32768 和 max-num-seqs=256 的组合。虽然平时大部分请求的 prompt 都在 2K-4K，KV Cache 足够。但凌晨有个批量任务在跑长文档摘要，prompt 长度 30K+，每个请求消耗的 KV Cache 是平时的 10 倍。

正常请求 KV Cache: 2K tokens × 每 token 约 0.8MB = 1.6MB
长文档请求 KV Cache: 30K tokens × 每 token 约 0.8MB = 24MB

KV Cache 总量: 54GB
正常情况 256 并发: 256 × 1.6MB = 410MB，绰绰有余
长文档 30 并发: 30 × 24MB = 720MB，也没问题
但如果 200 个长文档并发: 200 × 24MB = 4.8GB...

等等，不对。重新算：
Qwen3.5-35B-A3B-FP8，FP8 量化，KV Cache 每 token:
= num_layers × 2 × num_heads × head_dim × 2 bytes(FP16 KV cache)
= 64 × 2 × 8 × 128 × 2 = 262,144 bytes ≈ 0.25 MB/token

30K prompt: 30000 × 0.25MB = 7.5GB per request
如果同时来了 8 个这样的请求: 8 × 7.5 = 60GB > 54GB 可用 KV Cache

根因确认：批量长文档任务同时发起 8+ 个 30K token 的请求，KV Cache 总需求超过可用显存。

03:25 - 修复和重启

# 临时方案：降低 max-model-len，限制单请求长度
docker run -d \
  --gpus '"device=0"' \
  --memory=72g \
  --shm-size=8g \
  --name vllm-server \
  vllm/vllm-openai:v0.7.6 \
  --model Qwen/Qwen3.5-35B-A3B-FP8 \
  --gpu-memory-utilization 0.9 \
  --max-num-seqs 200 \
  --max-model-len 16384      # 从 32768 降到 16384

# 验证服务启动
sleep 120
curl http://localhost:8000/v1/models
# 正常返回模型列表

03:40 - 长期方案（第二天工作日处理）

批量任务改为排队执行，不并发
API 网关增加 prompt 长度检查，超过 16K 的走专用队列
容器内存限制从 72G 调到 96G
增加 KV Cache 水位监控告警

3.5 案例 2：间歇性超时根因定位

问题现象：

每天下午 2:00-2:30，API 超时率从 0.1% 飙到 15%。其他时间段正常。持续了一周。

排查过程：

第一步，确认超时模式

# 从 Prometheus 拉数据，按小时统计超时率
curl -s 'http://prometheus:9090/api/v1/query_range' \
  --data-urlencode 'query=sum(rate(vllm_request_failure_total{error="timeout"}[5m])) / sum(rate(vllm_request_success_total[5m])) * 100' \
  --data-urlencode 'start=2026-03-06T00:00:00Z' \
  --data-urlencode 'end=2026-03-12T23:59:59Z' \
  --data-urlencode 'step=300' | python -m json.tool

确认了每天 14:00-14:30 有明显的超时高峰。

第二步，看这个时段的 GPU 指标

# GPU 利用率
curl -s 'http://prometheus:9090/api/v1/query_range' \
  --data-urlencode 'query=nvidia_gpu_utilization{gpu="0"}' \
  --data-urlencode 'start=2026-03-12T13:50:00Z' \
  --data-urlencode 'end=2026-03-12T14:40:00Z' \
  --data-urlencode 'step=30'

GPU 利用率在 14:00 从 60% 突然飙到 100% 并持续 30 分钟。

第三步，看请求量

# QPS 变化
# 14:00 之前：稳定 30 QPS
# 14:00-14:30：飙到 150 QPS
# 14:30 之后：回落到 30 QPS

QPS 突增 5 倍，GPU 被打满，正常请求排不上队导致超时。

第四步，定位请求来源

# 查看 Nginx access log，按来源 IP 统计 14:00-14:30 的请求量
awk '$4 ~ /12\/Mar\/2026:14:0/ || $4 ~ /12\/Mar\/2026:14:1/ || $4 ~ /12\/Mar\/2026:14:2/' \
  /var/log/nginx/llm-access.log | \
  awk '{print $1}' | sort | uniq -c | sort -rn | head -10

# 输出：
# 3847 10.0.2.55
# 3201 10.0.2.56
#  245 10.0.3.12
#  189 10.0.3.15
#   ...

10.0.2.55 和 10.0.2.56 贡献了超过 90% 的请求。

第五步，找到元凶

# 查看这两个 IP 是什么服务
nslookup 10.0.2.55
# batch-worker-01.internal

nslookup 10.0.2.56
# batch-worker-02.internal

是批量处理服务。查看它们的 crontab：

ssh batch-worker-01 "crontab -l"
# 0 14 * * * /opt/scripts/daily_summary.sh

每天下午 2 点，两台批量处理服务器同时启动日报摘要任务，对 1000+ 篇文章做摘要，并发发送请求到 LLM 服务。瞬间打满了推理引擎。

解决方案：

批量任务错峰执行

# 改 crontab，两台机器错开 30 分钟
# batch-worker-01
0 2 * * * /opt/scripts/daily_summary.sh   # 改到凌晨 2 点

# batch-worker-02
30 2 * * * /opt/scripts/daily_summary.sh  # 改到凌晨 2:30

批量任务增加并发限制和限速

# daily_summary.py 增加限速
import asyncio
import aiohttp

CONCURRENCY = 5# 最多 5 个并发
RATE_LIMIT = 10# 每秒最多 10 个请求

semaphore = asyncio.Semaphore(CONCURRENCY)

asyncdef summarize_article(session, article):
    asyncwith semaphore:
        await asyncio.sleep(1.0 / RATE_LIMIT)  # 简单限速
        asyncwith session.post(
            "http://llm-api.example.com/v1/chat/completions",
            json={
                "model": "Qwen/Qwen3.5-35B-A3B-FP8",
                "messages": [{"role": "user", "content": f"Summarize: {article}"}],
                "max_tokens": 500,
            },
            timeout=aiohttp.ClientTimeout(total=120),
        ) as resp:
            if resp.status == 429:
                retry_after = int(resp.headers.get("Retry-After", 5))
                await asyncio.sleep(retry_after)
                # 重试逻辑...
            returnawait resp.json()

API 网关增加批量任务专用限流

# 批量任务 IP 段单独限流
geo $is_batch_worker {
    default 0;
    10.0.2.0/24 1;
}

map $is_batch_worker $batch_limit_key {
    0 "";
    1 $binary_remote_addr;
}

limit_req_zone $batch_limit_key zone=batch_limit:1m rate=10r/s;

location /v1/chat/completions {
    limit_req zone=batch_limit burst=5 nodelay;
    # ... 其他配置
}

修复后超时率降到 0.05% 以下，连续跑了一个月没再出现。

四、最佳实践和注意事项

4.1 最佳实践

4.1.1 预防性配置清单

部署大模型服务前，过一遍这个清单：

#!/bin/bash
# pre_deploy_check.sh - 部署前检查脚本

echo"=== Pre-deployment Checklist ==="

# 1. GPU 驱动版本
echo -n "[1] GPU Driver: "
DRIVER=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader | head -1)
if [[ $(echo"$DRIVER >= 560" | bc -l 2>/dev/null || echo"0") == "1" ]] || [[ "$DRIVER" > "560" ]]; then
    echo"OK ($DRIVER)"
else
    echo"FAIL ($DRIVER < 560)"
fi

# 2. CUDA 版本
echo -n "[2] CUDA Version: "
CUDA=$(nvcc --version 2>/dev/null | grep "release" | awk '{print $5}' | tr -d ',')
echo"$CUDA"

# 3. 显存大小
echo -n "[3] GPU Memory: "
nvidia-smi --query-gpu=name,memory.total --format=csv,noheader

# 4. CPU 内存
echo -n "[4] System Memory: "
free -h | awk '/Mem:/{print $2}'

# 5. 磁盘空间（模型目录）
echo -n "[5] Model Dir Space: "
df -h /models/ | awk 'NR==2{print $4, "available"}'

# 6. 模型文件完整性
echo -n "[6] Model Files: "
MODEL_DIR="/models/Qwen3.5-35B-A3B-FP8"
if [ -f "$MODEL_DIR/config.json" ] && [ -f "$MODEL_DIR/tokenizer.json" ]; then
    SAFETENSORS=$(ls "$MODEL_DIR"/*.safetensors 2>/dev/null | wc -l)
    echo"OK ($SAFETENSORS safetensors files)"
else
    echo"FAIL (missing config/tokenizer)"
fi

# 7. Docker 版本
echo -n "[7] Docker: "
docker --version 2>/dev/null | awk '{print $3}'

# 8. 网络端口
echo -n "[8] Port 8000: "
if ss -tlnp | grep -q ':8000'; then
    echo"IN USE (conflict!)"
else
    echo"Available"
fi

# 9. /dev/shm 大小
echo -n "[9] Shared Memory: "
df -h /dev/shm | awk 'NR==2{print $2}'

# 10. NUMA 拓扑
echo"[10] NUMA Topology:"
nvidia-smi topo -m 2>/dev/null | head -5

echo"=== Check Complete ==="

4.1.2 错误信息脱敏

生产环境的 API 不能把内部错误信息直接暴露给客户端。

# Nginx 配置：拦截并替换错误响应
location /v1/ {
    proxy_pass http://vllm_backend;

    # 拦截 500 错误
    proxy_intercept_errors on;

    error_page 500 502 503 504 @internal_error;
}

location @internal_error {
    default_type application/json;
    return 500 '{"error":{"message":"Internal server error. Please retry later.","type":"server_error","code":"internal_error"}}';
}

# Python API 层面的错误脱敏
from fastapi import FastAPI, HTTPException
from fastapi.responses import JSONResponse
import traceback
import logging

app = FastAPI()
logger = logging.getLogger(__name__)

@app.exception_handler(Exception)
asyncdef global_exception_handler(request, exc):
    # 内部记录完整错误
    logger.error(f"Unhandled exception: {traceback.format_exc()}")

    # 返回给客户端的脱敏信息
    if"CUDA"in str(exc) or"OutOfMemory"in str(exc):
        return JSONResponse(
            status_code=503,
            content={"error": {"message": "Service temporarily unavailable. Please retry.", "type": "server_error"}}
        )
    elif"timeout"in str(exc).lower():
        return JSONResponse(
            status_code=504,
            content={"error": {"message": "Request timeout. Please reduce input length or retry.", "type": "timeout_error"}}
        )
    else:
        return JSONResponse(
            status_code=500,
            content={"error": {"message": "Internal server error.", "type": "server_error"}}
        )

4.1.3 故障自愈机制

健康检查 + 自动重启 + 优雅降级的三层防护：

# Kubernetes 健康检查配置
apiVersion:apps/v1
kind:Deployment
metadata:
name:vllm-server
spec:
replicas:2
template:
    spec:
      containers:
        -name:vllm
          image:vllm/vllm-openai:v0.7.6
          args:
            -"--model"
            -"Qwen/Qwen3.5-35B-A3B-FP8"
            -"--gpu-memory-utilization"
            -"0.88"
            -"--max-num-seqs"
            -"200"

          # 存活探针：检测进程是否存活
          livenessProbe:
            httpGet:
              path:/health
              port:8000
            initialDelaySeconds:300   # 模型加载需要时间
            periodSeconds:30
            timeoutSeconds:10
            failureThreshold:3        # 连续 3 次失败则重启

          # 就绪探针：检测服务是否准备好接收流量
          readinessProbe:
            httpGet:
              path:/health
              port:8000
            initialDelaySeconds:300
            periodSeconds:10
            timeoutSeconds:5
            failureThreshold:2        # 连续 2 次失败则从 Service 摘除

          # 启动探针：加载模型的等待时间
          startupProbe:
            httpGet:
              path:/health
              port:8000
            initialDelaySeconds:60
            periodSeconds:10
            failureThreshold:30       # 最多等 60 + 30×10 = 360 秒

          resources:
            requests:
              memory:"32Gi"
              cpu:"8"
              nvidia.com/gpu:"1"
            limits:
              memory:"96Gi"
              cpu:"16"
              nvidia.com/gpu:"1"

4.1.4 Graceful Degradation 策略

当主模型不可用时，自动切换到轻量级模型：

# graceful_degradation.py

import time
import logging
from dataclasses import dataclass

logger = logging.getLogger(__name__)


@dataclass
class ModelEndpoint:
    name: str
    url: str
    model: str
    priority: int       # 越小优先级越高
    max_tokens: int
    healthy: bool = True
    last_check: float = 0
    failure_count: int = 0


class ModelRouter:
    def __init__(self):
        self.endpoints = [
            ModelEndpoint(
                name="primary",
                url="http://gpu-server-01:8000",
                model="Qwen/Qwen3.5-35B-A3B-FP8",
                priority=1,
                max_tokens=4096,
            ),
            ModelEndpoint(
                name="secondary",
                url="http://gpu-server-02:8000",
                model="Qwen/Qwen3.5-35B-A3B-FP8",
                priority=2,
                max_tokens=4096,
            ),
            ModelEndpoint(
                name="fallback",
                url="http://gpu-server-03:8000",
                model="Qwen/Qwen3-Coder-30B-A3B-FP8",
                priority=3,
                max_tokens=2048,
            ),
        ]

    def get_healthy_endpoint(self) -> ModelEndpoint:
        """返回优先级最高的健康端点"""
        healthy = [ep for ep in self.endpoints if ep.healthy]
        ifnot healthy:
            # 所有端点都不健康，强制重试主端点
            logger.critical("All endpoints unhealthy, forcing primary")
            self.endpoints[0].healthy = True
            return self.endpoints[0]
        return sorted(healthy, key=lambda x: x.priority)[0]

    def mark_unhealthy(self, endpoint: ModelEndpoint):
        endpoint.healthy = False
        endpoint.failure_count += 1
        logger.warning(f"Marked {endpoint.name} as unhealthy (failures: {endpoint.failure_count})")

    def mark_healthy(self, endpoint: ModelEndpoint):
        endpoint.healthy = True
        endpoint.failure_count = 0

    def health_check_all(self):
        """定期健康检查，恢复已修复的端点"""
        import requests
        for ep in self.endpoints:
            try:
                resp = requests.get(f"{ep.url}/health", timeout=5)
                if resp.status_code == 200:
                    ifnot ep.healthy:
                        logger.info(f"Endpoint {ep.name} recovered")
                    self.mark_healthy(ep)
                else:
                    self.mark_unhealthy(ep)
            except Exception:
                self.mark_unhealthy(ep)

4.2 注意事项

4.2.1 常见错误汇总表

错误信息	错误码	原因	解决方案
`torch.cuda.OutOfMemoryError: CUDA out of memory`	-	GPU 显存不足	降低 gpu-memory-utilization / max-num-seqs / max-model-len
`CUDA error: device-side assert triggered`	-	模型权重损坏或数值异常	重新下载模型，用 CUDA_LAUNCH_BLOCKING=1 调试
`RuntimeError: NCCL communicator was aborted`	-	多卡通信失败	检查 NVLink/PCIe，增大 shm-size
`HTTP 429 Too Many Requests`	429	请求限流	增加 rate limit 或 burst，客户端加重试
`HTTP 503 Service Unavailable`	503	服务不可用（加载中/崩溃）	检查服务进程和日志
`HTTP 504 Gateway Timeout`	504	请求超时	调大超时参数，检查 prompt 长度
`Connection refused`	-	服务未启动或端口错误	检查进程状态和端口监听
`OSError: [Errno 28] No space left on device`	-	磁盘空间不足	清理磁盘，扩容
`ValueError: The model's max seq len (X) is larger than...`	-	max-model-len 超过模型支持的上限	降低 max-model-len
`KeyError: 'model'`	400	请求体缺少 model 字段	检查客户端请求格式
`Model not found: xxx`	404	模型名不匹配	调用 /v1/models 确认模型名
`Tokenizer initialization error`	-	tokenizer 文件缺失或损坏	重新下载 tokenizer.json 和 tokenizer_config.json
`AssertionError: assert googly_batch_size > 0`	-	vLLM 内部错误（batch 为空）	升级 vLLM 版本，通常是已知 bug
`RuntimeError: expected scalar type Float but found Half`	-	精度类型不匹配	指定 --dtype float16 或 --dtype bfloat16
`CUDA error: no kernel image is available`	-	GPU 架构不支持	检查 GPU 是否是 Ampere+，更新 PyTorch

五、故障排查和监控

5.1 系统化排错决策树

文字版决策树，从症状开始逐步定位根因：

用户报告"大模型服务异常"
│
├── 症状：请求返回错误
│   ├── HTTP 429 → 限流问题
│   │   ├── 检查 Nginx 限流配置 → 调大 rate/burst
│   │   ├── 检查 vLLM 队列 → 调大 max-num-seqs
│   │   └── 检查 Token 配额 → 充值/调额度
│   │
│   ├── HTTP 500/502/503 → 服务端错误
│   │   ├── 进程存活？
│   │   │   ├── 不存在 → 检查 OOM Kill (dmesg) → 调大内存/降低负载
│   │   │   └── 存在 → 查 vLLM 日志 → 根据错误类型处理
│   │   └── 多个实例？检查负载均衡健康检查
│   │
│   ├── HTTP 504 → 超时
│   │   ├── TTFB 大？→ TTFT 超时
│   │   │   ├── GPU 利用率 > 95%？→ 扩容或限流
│   │   │   ├── 等待队列长？→ 降低并发
│   │   │   └── prompt 太长？→ 限制输入长度
│   │   └── Total 大但 TTFB 正常？→ TPS 低
│   │       ├── GPU 温度 > 83°C？→ 散热问题
│   │       └── KV Cache swap？→ 降低 max-num-seqs
│   │
│   └── 连接拒绝 → 服务未启动
│       ├── 端口监听？→ ss -tlnp
│       ├── 防火墙？→ iptables -L
│       └── 模型加载中？→ 看启动日志
│
├── 症状：请求变慢（没报错但延迟高）
│   ├── 全部请求都慢 → 全局问题
│   │   ├── GPU 利用率 → 负载过高
│   │   ├── segment compaction → 等待完成
│   │   └── 内存 swap → 加内存
│   └── 部分请求慢 → 个别请求问题
│       ├── 长 prompt → 正常现象，优化 prefill
│       └── 特定用户 → 检查是否触发限流
│
└── 症状：服务进程消失
    ├── Exit code 137 → OOM Killed
    │   ├── 容器 OOM → 调大 memory limit
    │   └── 系统 OOM → 加内存/加 swap
    ├── Exit code 134 → SIGABRT (assertion)
    │   └── CUDA error → 检查 GPU 状态和驱动
    └── Exit code 139 → SIGSEGV (段错误)
        └── 通常是 bug → 升级版本，提交 issue

5.2 日志关键字速查

在 vLLM 日志中 grep 这些关键字可以快速定位问题类型：

# OOM 相关
grep -i "OutOfMemory\|out of memory\|OOM\|oom_kill" /var/log/vllm/vllm.log

# CUDA 错误
grep -i "CUDA error\|cuda.*error\|RuntimeError.*CUDA" /var/log/vllm/vllm.log

# 超时相关
grep -i "timeout\|timed out\|deadline exceeded" /var/log/vllm/vllm.log

# 限流相关
grep -i "rate limit\|too many\|queue full\|preempting" /var/log/vllm/vllm.log

# 模型加载
grep -i "loading model\|model loaded\|weight.*load\|safetensors" /var/log/vllm/vllm.log

# KV Cache
grep -i "cache.*block\|kv cache\|cache_usage\|swap" /var/log/vllm/vllm.log

# NCCL（多卡）
grep -i "NCCL\|nccl\|communicator\|all_reduce" /var/log/vllm/vllm.log

# 进程退出
grep -i "fatal\|shutting down\|exit\|signal\|killed" /var/log/vllm/vllm.log

每个关键字对应的问题域：

关键字	问题域	紧急程度
`OutOfMemory`	GPU/CPU 内存不足	高 - 服务可能崩溃
`CUDA error`	GPU 异常	高 - 需要重启
`timeout`	请求超时	中 - 影响用户体验
`preempting`	KV Cache 不足，抢占	中 - 部分请求被中断
`queue full`	并发过高	中 - 新请求被拒绝
`NCCL`	多卡通信	高 - 可能导致卡死
`swap`	KV Cache 换出到 CPU	低 - 性能下降但不影响正确性

5.3 监控指标和告警配置

5.3.1 Prometheus 监控指标

vLLM 0.7.x 暴露的关键 metrics（通过 /metrics 端点）：

# 请求延迟
histogram_quantile(0.99, sum(rate(vllm:e2e_request_latency_seconds_bucket[5m])) by (le))

# TTFT（Time To First Token）
histogram_quantile(0.99, sum(rate(vllm:time_to_first_token_seconds_bucket[5m])) by (le))

# TPS（Time Per Output Token）
histogram_quantile(0.50, sum(rate(vllm:time_per_output_token_seconds_bucket[5m])) by (le))

# GPU 缓存使用率
vllm:gpu_cache_usage_perc

# CPU 缓存使用率（swap）
vllm:cpu_cache_usage_perc

# 等待队列长度
vllm:num_requests_waiting

# 正在处理的请求数
vllm:num_requests_running

# 请求成功/失败计数
rate(vllm:request_success_total[5m])
rate(vllm:request_failure_total[5m])

# Prompt Token 吞吐
rate(vllm:prompt_tokens_total[5m])

# Generation Token 吞吐
rate(vllm:generation_tokens_total[5m])

5.3.2 Prometheus 告警规则

# /etc/prometheus/rules/vllm_alerts.yml
groups:
-name:vllm_critical
    rules:
      # GPU OOM 风险
      -alert:VLLMGPUCacheNearFull
        expr:vllm:gpu_cache_usage_perc>0.95
        for:2m
        labels:
          severity:critical
        annotations:
          summary:"GPU KV Cache 使用率超过 95%"
          description:"实例 {{ $labels.instance }}，当前使用率 {{ $value | humanizePercentage }}，即将 OOM"
          runbook:"降低 max-num-seqs 或重启服务"

      # 服务不可用
      -alert:VLLMServiceDown
        expr:up{job="vllm"}==0
        for:1m
        labels:
          severity:critical
        annotations:
          summary:"vLLM 服务不可用"
          description:"实例 {{ $labels.instance }} 已经 1 分钟无法访问"

      # 请求失败率
      -alert:VLLMHighErrorRate
        expr:|
          sum(rate(vllm:request_failure_total[5m]))
          /
          (sum(rate(vllm:request_success_total[5m])) + sum(rate(vllm:request_failure_total[5m])))
          > 0.05
        for:5m
        labels:
          severity:critical
        annotations:
          summary:"vLLM 请求失败率超过 5%"
          description:"当前失败率 {{ $value | humanizePercentage }}"

-name:vllm_warning
    rules:
      # TTFT 过高
      -alert:VLLMHighTTFT
        expr:histogram_quantile(0.99,sum(rate(vllm:time_to_first_token_seconds_bucket[5m]))by(le))>5
        for:5m
        labels:
          severity:warning
        annotations:
          summary:"TTFT P99 超过 5 秒"
          description:"当前 TTFT P99: {{ $value }}s"

      # 等待队列过长
      -alert:VLLMQueueBacklog
        expr:vllm:num_requests_waiting>50
        for:3m
        labels:
          severity:warning
        annotations:
          summary:"等待队列积压超过 50 个请求"
          description:"当前等待: {{ $value }}，考虑扩容或限流"

      # GPU 温度
      -alert:GPUTemperatureHigh
        expr:nvidia_smi_temperature_gpu>83
        for:5m
        labels:
          severity:warning
        annotations:
          summary:"GPU 温度超过 83°C"
          description:"GPU {{ $labels.gpu }}: {{ $value }}°C，可能触发降频"

      # KV Cache swap 发生
      -alert:VLLMKVCacheSwap
        expr:vllm:cpu_cache_usage_perc>0
        for:5m
        labels:
          severity:warning
        annotations:
          summary:"检测到 KV Cache swap 到 CPU"
          description:"CPU cache 使用率: {{ $value | humanizePercentage }}，TPS 会下降"

      # E2E 延迟过高
      -alert:VLLMHighLatency
        expr:histogram_quantile(0.99,sum(rate(vllm:e2e_request_latency_seconds_bucket[5m]))by(le))>30
        for:5m
        labels:
          severity:warning
        annotations:
          summary:"端到端 P99 延迟超过 30 秒"
          description:"当前 P99: {{ $value }}s"

5.3.3 Grafana Dashboard 面板建议

推荐创建以下面板（按行排列）：

第一行：概览

服务状态（Up/Down 指示灯）
当前 QPS（实时数字）
请求成功率（百分比）
等待队列长度

第二行：延迟

TTFT P50/P95/P99 时序图
E2E Latency P50/P95/P99 时序图
每请求 Token 输出速度

第三行：资源

GPU 显存使用率
GPU KV Cache 使用率
GPU 利用率和温度
CPU 内存使用

第四行：吞吐

Prompt Tokens/s
Generation Tokens/s
Running requests vs Waiting requests

六、总结

6.1 技术要点回顾

大模型服务的故障 80% 集中在超时、限流、OOM 三类。系统化的排错方法比经验更可靠——从症状出发，逐层排查，到根因止步
超时问题要分清是 TTFT 慢（prefill 阶段）还是 TPS 低（decode 阶段）还是连接超时（网络层），三者的排查路径完全不同
GPU OOM 的根因通常是 max-num-seqs、max-model-len、gpu-memory-utilization 三个参数的组合不合理。不要凭感觉调，用公式算出 KV Cache 的实际需求
限流配置要分层：Nginx 做第一层粗粒度限流，vLLM 的 max-num-seqs 做第二层保护，客户端做重试和降级。三层配合才能既保护服务又不影响用户
容器环境的 OOM Killed 和 GPU OOM 是两回事：前者是 CPU 内存不够（cgroup limit），后者是显存不够。区分清楚才能对症下药
预防大于治疗：部署前过检查清单，运行时盯核心监控，故障时靠自愈机制兜底

6.2 进阶学习方向

vLLM 源码分析：了解 PagedAttention、Continuous Batching、Speculative Decoding 的实现原理，有助于理解性能瓶颈的本质。推荐从 vllm/core/scheduler.py 开始读。
GPU 性能调优：CUDA Profiler（nsys/ncu）的使用、kernel 级别的性能分析。这块对排查"GPU 利用率高但吞吐上不去"的问题很有帮助。
分布式推理：Pipeline Parallelism、Tensor Parallelism、Expert Parallelism 的运维差异，多机多卡环境下 NCCL 的故障排查。

6.3 参考资料

vLLM 官方文档 - 配置参数和 API 参考
vLLM GitHub - Issue tracker 是排查已知 bug 的好地方
NVIDIA GPU 故障排查指南 - CUDA 错误排查
NCCL 官方文档 - 多卡通信故障排查
Prometheus Alerting Rules - 告警规则语法参考

附录

A. 报错速查表


错误码/信息                                     → 原因                          → 解决方案
─────────────────────────────────────────────────────────────────────────────────────────────
HTTP 429 Too Many Requests                     → 请求速率超过限流配置            → 调大 rate limit 或 burst
HTTP 500 Internal Server Error                 → 服务端内部错误                  → 查看 vLLM 日志定位具体错误
HTTP 502 Bad Gateway                           → 后端服务不可用                  → 检查 vLLM 进程是否存活
HTTP 503 Service Unavailable                   → 服务过载或维护中                → 检查负载和健康检查
HTTP 504 Gateway Timeout                       → 请求处理超时                    → 调大超时参数或降低 prompt 长度
torch.cuda.OutOfMemoryError                    → GPU 显存不足                    → 降低 gpu-memory-utilization/max-num-seqs
CUDA error: out of memory                      → CUDA 内存分配失败               → 重启进程，调整显存参数
CUDA error: device-side assert triggered       → GPU 计算异常                    → 检查模型文件，用 BF16
RuntimeError: NCCL communicator was aborted    → 多卡通信断开                    → 检查 GPU 拓扑和 shm-size
OSError: No space left on device               → 磁盘满                          → 清理日志和临时文件
ConnectionRefusedError                         → 服务未启动                      → 检查进程和端口
ValueError: max_model_len too large            → 上下文长度超过模型支持           → 降低 max-model-len
OOM Killed (exit code 137)                     → 容器/系统内存限制               → 调大 memory limit
SIGABRT (exit code 134)                        → 致命 assertion 失败             → 查日志，通常是 CUDA/NCCL 错误
KeyError: 'choices'                            → 响应格式异常                    → 服务端可能返回了错误，检查完整响应
asyncio.TimeoutError                           → 客户端异步超时                  → 调大 aiohttp/httpx timeout
ssl.SSLError                                   → TLS 证书问题                    → 检查证书有效期和配置

B. vLLM 关键参数速查

# 模型相关
--model <path_or_name>           # 模型路径或 HuggingFace 名称
--dtype auto|float16|bfloat16    # 推理精度，auto 自动选择
--quantization fp8|awq|gptq     # 量化方式
--max-model-len <int>            # 最大上下文长度（prompt + generation）
--trust-remote-code              # 允许执行模型仓库中的自定义代码

# 显存和缓存
--gpu-memory-utilization <0-1>   # GPU 显存使用比例（默认 0.9）
--swap-space <GB>                # CPU swap 空间大小（默认 4）
--kv-cache-dtype auto|fp8        # KV Cache 精度

# 并发和调度
--max-num-seqs <int>             # 最大并发序列数（默认 256）
--max-num-batched-tokens <int>   # 每个 batch 最大 token 数
--enable-chunked-prefill          # 启用分块预填充（推荐）
--max-paddings <int>             # 最大 padding 数

# 多卡
--tensor-parallel-size <int>     # 张量并行度（用多少张卡）
--pipeline-parallel-size <int>   # 流水线并行度

# 服务
--host <ip>                      # 监听地址（默认 localhost）
--port <int>                     # 监听端口（默认 8000）
--request-timeout <seconds>      # 单请求超时
--disable-log-requests           # 不记录每个请求（生产环境推荐）

# 监控
--enable-metrics                 # 启用 Prometheus metrics（默认启用）

常用参数组合速查：

# 小模型（< 10B），单卡，追求低延迟
--gpu-memory-utilization 0.85 --max-num-seqs 32 --max-model-len 32768

# 中模型（10-30B），单卡 A100 80G，平衡配置
--gpu-memory-utilization 0.88 --max-num-seqs 128 --max-model-len 16384 --enable-chunked-prefill

# 大模型（30-70B），双卡 A100 80G
--tensor-parallel-size 2 --gpu-memory-utilization 0.9 --max-num-seqs 200 --max-model-len 16384

# 超大模型（> 70B），4 卡 A100 80G
--tensor-parallel-size 4 --gpu-memory-utilization 0.92 --max-num-seqs 256 --max-model-len 8192