llama.cpp 部署 Qwen3-14B-Claude-4.5-Opus-High-Reasoning-Distill-GGUF

优质文章学习记录

06 Apr 2026 — 11 min read

模型：Qwen3-14B-Claude-4.5-Opus-High-Reasoning-Distill-GGUF

"model": "Qwen3-14B"

显存：21~25GB

max-model-len ：40960

并发: 4

部署服务器：DGX-Spark-GB10 120GB

生成速率：13 tokens/s （慢的原因分析可见https://blog.ZEEKLOG.net/weixin_69334636/article/details/158497823?spm=1001.2014.3001.5501）

部署GGUF格式的模型有3种方法

对比项	Ollama	llama.cpp	LM Studio/OpenWebUI
上手难度	⭐ 最简单	⭐⭐⭐ 需编译	⭐ 图形界面
推理性能	🔶 中等	🥇 最强	🔶 中等
GPU控制	有限	完全可控	有限
API服务	开箱即用	需手动启动	内置
适合场景	快速部署/生产	性能调优/研究	本地体验

第1种：使用Ollama

前提：已经安装了ollama

第一步：Huggingface 或modelscope下载模型

<https://huggingface.co/TeichAI/Qwen3-14B-Claude-4.5-Opus-High-Reasoning-Distill-GGUF/tree/main>

第二步：修改Modelfile:使用Qwen3-14B-Claude-4.5-Opus-Distill.q8_0.gguf模型

FROM ./Qwen3-14B-Claude-4.5-Opus-Distill.q8_0.gguf TEMPLATE """{{ if .System }}<|im_start|>system {{ .System }}<|im_end|> {{ end }}{{ if .Prompt }}<|im_start|>user {{ .Prompt }}<|im_end|> {{ end }}<|im_start|>assistant {{ .Response }}<|im_end|>""" PARAMETER temperature 0.6 PARAMETER top_p 0.95 PARAMETER repeat_penalty 1.0

第三步：创建ollama实例

ollama create qwen3-claude-distill -f Modelfile

第四步：测试

注意：模型的思考模板有些问题”\u003cthink\u003e\n“，需要修改

Ollama API 访问 Ollama 默认端口是 11434，直接用： curl <http://localhost:11434/api/chat> \ -H "Content-Type: application/json" \ -d '{ "model": "qwen3-claude-distill", "messages": [ {"role": "user", "content": "你好，介绍一下你自己"} ], "stream": false }' # 响应： {"model":"qwen3-claude-distill","created_at":"2026-02-24T10:10:02.627171372Z","message":{"role":"assistant","content":"\\u003cthink\\u003e\\n用户让我介绍一下自己，这是一个很好的机会让我展示我的功能和特点，同时保持友好和自然的对话风格。\\n\\n我要介绍的内容应该包括：\\n1. 我是Qwen，是阿里巴巴集团旗下的通义实验室研发的超大规模语言模型\\n2. 我的中文名\\"通义千问\\"，英文名\\"Qwen\\"\\n3. 我的训练数据截止时间是2024年\\n4. 我的功能和应用场景（回答问题、创作、编程等）\\n5. 我的性格特点（友好、有帮助、诚实）\\n6. 我支持多语言交流\\n\\n我需要用自然的口语化中文表达，避免使用Markdown格式，保持段落简短，适当换行。同时要表达出我是AI助手的身份，但用词要亲切自然。\\n\\n让我组织一下语言：\\n\\u003c/think\\u003e\\n\\n 你好呀！我是Qwen，是阿里巴巴集团旗下的通义实验室研发的超大规模语言模型。你可以叫我通义千问或者Qwen。\\n\\n我主要负责回答各种问题、创作文字、编程协助，还有日常聊天陪伴。我的训练数据截止到2024年，所以对最新的信息可能了解得不够全面，但我会尽力提供有用的信息。\\n\\n我努力做到友好、有帮助，同时保持诚实。如果有不确定的地方，我也会坦率地告诉你。\\n\\n支持中文和英文交流，如果你有其他语言需求，也可以试试看哦！\\n\\n有什么我可以帮你的吗？"},"done":true,"done_reason":"stop","total_duration":21788815174,"load_duration":95605294,"prompt_eval_count":12,"prompt_eval_duration":74178850,"eval_count":301,"eval_duration":21564386933}

第2种：llama.cpp

第一步：下载 llama

git clone <https://github.com/ggerganov/llama.cpp>

第二步：GPU构建

cd ./llama.cpp cmake -B build \ -DGGML_CUDA=ON \ -DLLAMA_BUILD_SERVER=ON \ -DCMAKE_BUILD_TYPE=Release cmake --build build --config Release -j 8 这将以8个并行编译任务来构建程序。 结果将存于 ./build/bin/ 。 # 构建失败可直接删除build目录即可 rm -rf build # 参数说明： 参数 作用 -DLLAMA_BUILD_SERVER=ON 强制构建 llama-server -DGGML_CUDA=ON 启用 GPU Release 性能更好 # 验证安装成功 ./build/bin/llama-server --help

第三步：部署模型(使用下载好的gguf模型)

# 简化命令 ./build/bin/llama-server \ -m /home/admin/models/huggingface/Qwen3-14B-Claude-4.5-Opus-High-Reasoning-Distill-GGUF/Qwen3-14B-Claude-4.5-Opus-Distill.q8_0.gguf \ -ngl 999 \ -c 40960 \ --host 0.0.0.0 \ --port 8908

后台运行部署

nohup ./build/bin/llama-server \ -m /home/admin/models/huggingface/Qwen3-14B-Claude-4.5-Opus-High-Reasoning-Distill-GGUF/Qwen3-14B-Claude-4.5-Opus-Distill.q8_0.gguf \ -ngl 999 \ --batch-size 1024 \ --threads 16 \ --parallel 4 \ --jinja \ --reasoning-format deepseek \ --reasoning-budget -1 \ -c 40960 \ --host 0.0.0.0 \ --port 8908 \ >> /home/admin/models/logs/llama_Qwen3-14B_Distill.log 2>&1 & # 查看 tail -f ~/models/logs/llama_Qwen3-14B_Distill.log

参数说明：

 --n-gpu-layers：指定有多少 transformer 层放到 GPU 上执行 0 全部 CPU 20 前 20 层 GPU 999 尽可能全部 GPU -c 40960: 即--ctx-size ，上下文长度（最大 token 数） --host 0.0.0.0：是否可远程访问：使用此参数，可以局域网可访问 -port 8908：HTTP 监听端口 --threads 16:CPU 线程数量 但你只有 16 核： → 线程抢占 → 反而性能下降 --batch-size 1024: GPU 每一步最多算多少 token --parallel 4:允许同时处理多少个请求（并发会话数） --reasoning-format deepseek：思考模板 --reasoning-budget N：思考模式控制 值 含义 -1 不限制思考（默认，开启） 0 禁用思考模式 >0 限制思考token数量（部分模型支持）

重要提醒（关于 40K）

Qwen3-14B q8_0：

模型权重 ≈ 15~16GB
40K KV cache 可能占 10GB+
总显存可能 > 28GB

如果你 GPU 只有 24GB，会爆显存。

第四步：测试

对话端点

<http://localhost:8908/v1/chat/completions> <http://服务器IP:8908/v1/chat/completions>

默认开启思考模式

# 请求(默认开启思考模式) curl <http://192.168.0.254:8908/v1/chat/completions> \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen3-14B", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "介绍一下新加坡"} ], "temperature": 0.7, "max_tokens": 500 }' # 响应 { "choices": [ { "finish_reason": "stop", "index": 0, "message": { "role": "assistant", "content": "# 新加坡简介\\n\\n## 基本概况\\n- **全称**：新加坡共和国（Republic of Singapore）\\n- **人口**：约580万（2024年）\\n- **面积**：728.6平方公里\\n- **首都**：新加坡市（无正式首都，行政中心）\\n- **国家元首**：哈莉玛·雅各布总统（2023年起）\\n- **政府首脑**：李显龙总理\\n- **国家象征**：红白旗、国狮\\n\\n---\\n\\n## 地理位置\\n新加坡位于东南亚马来半岛南端，扼守马六甲海峡要冲，是连接太平洋与印度洋的航运枢纽。这个被柔佛海峡环抱的热带岛国，战略位置堪称\\"东方十字路口\\"。\\n\\n---\\n\\n## 历史沿革\\n| 时期 | 重大事件 |\\n|------|---------|\\n| 1819年 | 斯坦福·莱佛士建立贸易站 |\\n| 1824年 | 英国东印度公司正式接管 |\\n| 1955年 | 实行自治 |\\n| 1965年8月9日 | 正式独立 |\\n| 1967年 | 新加坡-马来西亚分家 |\\n\\n从英国殖民地到独立国家，新加坡在短短几十年间实现了惊人的转型。\\n\\n---\\n\\n## 政治体制\\n- **政体**：议会共和制\\n- **执政党**：人民行动党（PAP，自1959年以来）\\n- **特色**：以\\"廉洁政府\\"著称，多次被透明国际评为清廉国家\\n- **选举制度**：混合选举制，兼顾选区与集团代表制\\n\\n---\\n\\n## 经济实力\\n- **GDP总量**：约4400亿美元（2023年）\\n- **人均GDP**：约8.5万美元\\n- **支柱产业**：金融、电子制造、生物医药、物流\\n- **全球排名**：世界银行《营商环境报告》常年榜首\\n- **货币**：新加坡元（SGD）\\n\\n作为全球四大国际金融中心之一，新加坡拥有亚洲最自由的经济体系。\\n\\n---\\n\\n## 社会特色\\n- **多语言社会**：四种官方语言（英语、华语、马来语、泰米尔语）\\n- **宗教多元**：佛教、伊斯兰教、基督教、印度教等和谐共存\\n- **教育水平**：识字率100%，PISA测试成绩常居亚洲前列\\n- **医疗体系**：全民医保，人均寿命83.5岁（2023年）\\n\\n---\\n\\n## 城市亮点\\n- **花园城市**：树木覆盖率超过50%，拥有1800多个公园\\n- **摩天大楼**：全球最高楼之一的滨海湾金沙酒店\\n- **美食天堂**：2020年联合国教科文组织认证\\"美食之都\\"\\n- **交通**：世界上最准时的地铁系统之一\\n- **安全指数**：全球最安全城市前列\\n\\n---\\n\\n## 独特之处\\n- 世界上唯一拥有内河港口的国家\\n- 全球唯一拥有热带雨林的城市国家\\n- 人均GDP超过瑞士的发达国家\\n- 华人占比74%的国家中保持英语优势\\n\\n新加坡将东方传统与西方现代性完美融合，创造了独特的\\"新加坡模式\\"。这个弹丸小国用智慧和勤奋，在国际舞台上绽放出耀眼光芒。", "reasoning_content": "The user wants me to introduce Singapore. Let me provide a comprehensive overview of this city-state.\\n\\n**Geography and Location:**\\n- Located at the southern tip of the Malay Peninsula\\n- At the crossroads of important shipping routes (Sunda Strait, Malacca Strait)\\n- Part of the Malay archipelago\\n- Strategically positioned between Southeast Asia, Oceania, and Asia\\n\\n**Basic Facts:**\\n- Official name: Republic of Singapore\\n- Population: ~5.8 million (2024 estimate)\\n- Area: 728.6 km² (land area) - very densely populated\\n- Capital: Singapore City (no official capital, administrative center)\\n- Status: City-state, sovereign nation\\n\\n**History:**\\n- Originally a fishing village\\n- Founded by Sir Stamford Raffles in 1819 as a trading post for the British East India Company\\n- Became Crown Colony in 1824\\n- Self-governance began in 1955\\n- Independence achieved on August 9, 1965\\n\\n**Government:**\\n- Parliamentary representative democracy\\n- Head of State: President (elected, ceremonial role)\\n- Current President: Halimah Yacob (2023-2029 term)\\n- Prime Minister: Lee Hsien Loong\\n- Single dominant party: People's Action Party (PAP) - though there are other parties\\n- Known for clean government and low corruption\\n\\n**Economy:**\\n- Global financial hub\\n- Major industries: financial services, electronics manufacturing, petroleum refining, biomedical sciences\\n- One of the world's freest economies\\n- High GDP per capita (~$70,000+)\\n- Currency: Singapore Dollar (SGD)\\n- Key ports and infrastructure\\n\\n**Culture:**\\n- Four official languages: English, Malay, Mandarin, Tamil\\n- Multicultural society with Chinese, Malay, Indian, and Eurasian communities\\n- Religious diversity: Buddhism, Islam, Christianity, Hinduism, Sikhism, Taoism\\n\\n\\n- Unique cultural blend evident in food, festivals, and traditions\\n- Vibrant street food culture with hawker centers\\n- Colorful festivals like Chinese New Year, Hari Raya, Deepavali, and Christmas\\n\\n**Tourism Highlights:**\\n- Iconic landmarks such as Marina Bay Sands and Gardens by the Bay\\n- Historical sites like the Sultan Mosque and Kampong Glam\\n- Modern attractions including the Singapore Flyer and Universal Studios Singapore\\n- Natural beauty with parks and nature reserves\\n\\n**Notable Characteristics:**\\n- Exceptionally clean and safe city\\n- Advanced infrastructure and efficient public services\\n- Strict laws with unique penalties\\n- High standard of living\\n- Strategic global business location" } } ], "created": 1772009362, "model": "Qwen3-14B-Claude-4.5-Opus-Distill.q8_0.gguf", "system_fingerprint": "b8148-244641955", "object": "chat.completion", "usage": { "completion_tokens": 1315, "prompt_tokens": 21, "total_tokens": 1336 }, "id": "chatcmpl-hFRSoVL8IHJmGl5GrVfFSMvSuA86JMrT", "timings": { "cache_n": 20, "prompt_n": 1, "prompt_ms": 75.532, "prompt_per_token_ms": 75.532, "prompt_per_second": 13.2394217020601, "predicted_n": 1315, "predicted_ms": 94887.726, "predicted_per_token_ms": 72.1579665399239, "predicted_per_second": 13.8584836567798 } }

关闭思考模式

# 关闭思考模式的请求 curl <http://192.168.0.254:8908/v1/chat/completions> \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen3-14B", "messages": [ {"role": "system", "content": "你是一个只回答用户问题的助手"}, {"role": "user", "content": "你好"} ], "temperature": 0.7, "max_tokens": 200, "chat_template_kwargs": { "enable_thinking": false } }' # 响应（"content"为空，是为"max_tokens": 200） { "choices": [ { "finish_reason": "stop", "index": 0, "message": { "role": "assistant", "content": " 你好！有什么我可以帮你的吗？" } } ], "created": 1772174720, "model": "Qwen3-14B-Claude-4.5-Opus-Distill.q8_0.gguf", "system_fingerprint": "b8148-244641955", "object": "chat.completion", "usage": { "completion_tokens": 10, "prompt_tokens": 26, "total_tokens": 36 }, "id": "chatcmpl-S52Gwewoh96JRRrdm5KdB21KGrnuVODJ", "timings": { "cache_n": 16, "prompt_n": 10, "prompt_ms": 87.113, "prompt_per_token_ms": 8.7113, "prompt_per_second": 114.793429224111, "predicted_n": 10, "predicted_ms": 648.929, "predicted_per_token_ms": 64.8929, "predicted_per_second": 15.4100063335126 } }

工具的调用

# 请求 curl <http://192.168.0.254:8908/v1/chat/completions> \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen3-14B", "messages": [ { "role": "system", "content": "你是一个只回答用户问题的助手" }, { "role": "user", "content": "新加坡现在几点？" } ], "temperature": 0.7, "max_tokens": 200, "tools": [ { "type": "function", "function": { "name": "get_current_time", "description": "获取指定城市的当前时间", "parameters": { "type": "object", "properties": { "city": { "type": "string", "description": "城市名称" } }, "required": ["city"] } } } ], "tool_choice": "auto", "chat_template_kwargs": { "enable_thinking": false } }' # 响应 { "choices": [ { "finish_reason": "tool_calls", "index": 0, "message": { "role": "assistant", "content": "", "tool_calls": [ { "type": "function", "function": { "name": "get_current_time", "arguments": "{\\"city\\":\\"\\\\u65b0\\\\u52a0\\\\u5761\\"}" }, "id": "UJgwQ1xfUUARN2axHjgTz9U6waRJ2xD3" } ] } } ], "created": 1772174881, "model": "Qwen3-14B-Claude-4.5-Opus-Distill.q8_0.gguf", "system_fingerprint": "b8148-244641955", "object": "chat.completion", "usage": { "completion_tokens": 21, "prompt_tokens": 171, "total_tokens": 192 }, "id": "chatcmpl-cqENt08ZMWZLZfzlB3pV0S8RGKJtZ9Z0", "timings": { "cache_n": 0, "prompt_n": 171, "prompt_ms": 196.58, "prompt_per_token_ms": 1.14959064327485, "prompt_per_second": 869.874860107844, "predicted_n": 21, "predicted_ms": 1455.213, "predicted_per_token_ms": 69.2958571428571, "predicted_per_second": 14.4308771293275 } }

🔥nohup 服务停止

假设你这样启动：

nohup ./build/bin/llama-server ... > llama.log 2>&1 &

✅ 方法 1（推荐）

1 查看进程 ps aux | grep llama-server ---显示---- admin 12345 ... 12345 就是 PID。 2 杀掉进程 kill -9 12345

✅ 方法 2（最快）

pkill llama-server

⚠️ 会杀掉所有 llama-server 进程。

✅ 方法 3（精确杀端口）

如果你知道端口是 8908：

lsof -i:8908 kill 进程号

🏆 推荐做法（生产环境）使用： systemctl

管理服务，而不是 nohup

nohup	systemd
手动管理	自动重启
无状态管理	可开机启动
无健康检测	有状态监控

llama.cpp 部署 Qwen3-14B-Claude-4.5-Opus-High-Reasoning-Distill-GGUF

优质文章学习记录

第1种：使用Ollama

第2种：llama.cpp

第四步：测试

🔥nohup 服务停止

✅ 方法 1（推荐）

✅ 方法 2（最快）

✅ 方法 3（精确杀端口）

🏆 推荐做法（生产环境）使用： systemctl

Read more

Java LLM开发框架全面解析：从Spring AI到Agents-Flex

AI 模型高效化：推理加速与训练优化的技术原理与理论解析

【前沿解析】2026年3月15日：微软BitNet.cpp突破AI推理硬件枷锁——单CPU运行100B大模型，无损推理与能耗双重革新

用微信指挥你的 AI 员工：QClaw 给普通人发了一张超级个体的入场券

第1种：使用Ollama

第2种：llama.cpp

第四步：测试

🔥nohup 服务 停止

✅ 方法 1（推荐）

✅ 方法 2（最快）

✅ 方法 3（精确杀端口）

🏆 推荐做法（生产环境）使用： systemctl

Read more

Java LLM开发框架全面解析：从Spring AI到Agents-Flex

AI 模型高效化：推理加速与训练优化的技术原理与理论解析

【前沿解析】2026年3月15日：微软BitNet.cpp突破AI推理硬件枷锁——单CPU运行100B大模型，无损推理与能耗双重革新

用微信指挥你的 AI 员工：QClaw 给普通人发了一张超级个体的入场券

🔥nohup 服务停止