Windows 环境下 llama.cpp 编译与 Qwen 模型本地部署指南

介绍在 Windows 系统下编译 llama.cpp 并部署 Qwen 模型的方法。涵盖 CPU/GPU 编译配置、GGUF 模型下载、API 服务启动及 Python 客户端调用示例。支持多轮对话、上下文记忆及函数工具调用功能，实现本地化大模型应用。

灭霸发布于 2026/4/5更新于 2026/4/188 浏览

在大模型落地场景中，本地轻量化部署因低延迟、高隐私性、无需依赖云端算力等优势，成为开发者与 AI 爱好者的热门需求。本文聚焦 Windows 10/11（64 位）环境，详细拆解 llama.cpp 工具的编译流程（支持 CPU/GPU 双模式，GPU 加速需依赖 NVIDIA CUDA），并指导如何下载 GGUF 格式的 Qwen-7B-Chat 模型，最终实现模型本地启动与 API 服务搭建。

1. 克隆代码

打开管理员权限的 PowerShell/CMD，执行以下命令克隆代码：

git clone https://github.com/ggml-org/llama.cpp
mkdir build
cd build

2. 编译配置

基础编译（仅 CPU 支持）或者选用 GPU 加速编译（已安装 CUDA Toolkit）。

仅使用 CPU：

cmake .. -G "Visual Studio 17 2022" -A x64 -DLLAMA_CURL=OFF
cmake --build . --config Release

开启 GPU 支持（已安装 CUDA Toolkit）：

cmake .. -G "Visual Studio 17 2022" -A x64 -DLLAMA_CUDA=ON
cmake --build . --config Release

3. 下载模型

下载 GGUF 格式的 Qwen 模型（以 7B 为例）。

pip install modelscope
modelscope download --model Xorbits/Qwen-7B-Chat-GGUF

下载后的保存位置通常为 \modelscope\hub\models\Xorbits。

4. 启动 API 服务

运行模型启动 API 服务（支持 HTTP 调用）。

CPU 版：

chcp 65001
llama-cli.exe -m qwen.gguf -i -c 4096

GPU 加速版（监听 8080 端口）：

llama-server.exe -m qwen-7b-chat.Q4_0.gguf -c 4096 --port 8080 --host 127.0.0.1 --n-gpu-layers -1

服务启动后默认监听 http://localhost:8080，可通过 curl 测试调用效果。

curl http://localhost:8080/completion -H "Content-Type: application/json" -d '{ "prompt": "你好，介绍一下通义千问", "temperature": 0.7, "max_tokens": 512 }'

5. Python 客户端调用

通过代码调用大模型测试效果。

基础非流式调用（completion 端点）

import requests
import json

url = "http://localhost:8080/completion"
headers = {"Content-Type": }
data = {
    : ,
    : ,
    : ,
    : ,
    : ,
    : []
}
:
    response = requests.post(url, headers=headers, data=json.dumps(data), timeout=)
    response.raise_for_status()
    result = response.json()
    ()
    (result[])
 Exception  e:
    ()

相关免费在线工具

加密/解密文本

使用加密算法（如AES、TripleDES、Rabbit或RC4）加密和解密文本明文。在线工具，加密/解密文本在线工具，online

RSA密钥对生成器

生成新的随机RSA私钥和公钥pem证书。在线工具，RSA密钥对生成器在线工具，online

Mermaid 预览与可视化编辑

基于 Mermaid.js 实时预览流程图、时序图等图表，支持源码编辑与即时渲染。在线工具，Mermaid 预览与可视化编辑在线工具，online

curl 转代码

解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。在线工具，curl 转代码在线工具，online

Base64 字符串编码/解码

将字符串编码和解码为其 Base64 格式表示形式即可。在线工具，Base64 字符串编码/解码在线工具，online

Base64 文件转换器

将字符串、文件或图像转换为其 Base64 表示形式。在线工具，Base64 文件转换器在线工具，online

import requests import json import re from datetime import datetime # 定义可用工具集 def get_current_time(): current_time = datetime.now().strftime("%Y-%m-%d %H:%M:%S") return f"当前时间为：{current_time}" def calculate_add(a: float, b: float): return f"{a} + {b} = {a + b}" tool_registry = { "get_current_time": { "function": get_current_time, "description": "获取当前的本地时间，无需参数", "parameters": {} }, "calculate_add": { "function": calculate_add, "description": "计算两个数字的加法，需要参数 a 和 b", "parameters": { "a": {"type": "float", "required": True, "description": "加数 1"}, "b": {"type": "float", "required": True, "description": "加数 2"} } } } chat_history = [{"role": "system", "content": "你是一个有帮助的助手，可以调用以下工具来辅助回答：1. get_current_time 2. calculate_add。如果需要调用工具，请严格按照 JSON 格式返回。"}] url = "http://localhost:8080/chat/completions" headers = {"Content-Type": "application/json"} def clean_pad_content(content): return re.sub(r'\[PAD\d+\]', '', content).strip() def parse_tool_call(content): try: json_match = re.search(r'\{[\s\S]*\}', content) if not json_match: return None tool_call = json.loads(json_match.group()) if "name" in tool_call and "parameters" in tool_call: return tool_call return None except (json.JSONDecodeError, Exception): return None def execute_tool(tool_call): tool_name = tool_call["name"] parameters = tool_call.get("parameters", {}) if tool_name not in tool_registry: return f"错误：不存在名为 {tool_name} 的工具" tool_info = tool_registry[tool_name] tool_func = tool_info["function"] tool_params = tool_info["parameters"] try: for param_name, param_info in tool_params.items(): if param_name in parameters: param_type = param_info.get("type", "str") if param_type == "float": parameters[param_name] = float(parameters[param_name]) elif param_type == "int": parameters[param_name] = int(parameters[param_name]) result = tool_func(**parameters) return f"工具调用成功（{tool_name}）：{result}" except Exception as e: return f"错误：执行 {tool_name} 失败 - {str(e)}" def chat_with_model(prompt): global chat_history chat_history.append({"role": "user", "content": prompt}) data = { "model": "qwen.gguf", "messages": chat_history, "temperature": 0.7, "max_tokens": 512, "stream": False, "stop": ["[PAD"] } try: response = requests.post(url, headers=headers, data=json.dumps(data), timeout=60) response.raise_for_status() result = response.json() if "choices" in result and len(result["choices"]) > 0: raw_answer = result["choices"][0]["message"]["content"] clean_answer = clean_pad_content(raw_answer) tool_call = parse_tool_call(clean_answer) if tool_call: tool_result = execute_tool(tool_call) chat_history.append({"role": "assistant", "content": f"工具调用结果：{tool_result}"}) second_response = requests.post(url, headers=headers, data=json.dumps(data), timeout=60) second_result = second_response.json() final_answer = clean_pad_content(second_result["choices"][0]["message"]["content"]) chat_history.append({"role": "assistant", "content": final_answer}) return final_answer else: chat_history.append({"role": "assistant", "content": clean_answer}) return clean_answer return "返回格式异常" except Exception as e: return f"调用失败：{str(e)}" if __name__ == "__main__": print("开始多轮对话（输入'退出'结束）：") while True: user_input = input("你：") if user_input.strip() == "退出": break if not user_input.strip(): print("助手：请输入有效内容！\n") continue answer = chat_with_model(user_input) print(f"助手：{answer}\n")

Windows 环境下 llama.cpp 编译与 Qwen 模型本地部署指南

1. 克隆代码

2. 编译配置

3. 下载模型

4. 启动 API 服务

5. Python 客户端调用

基础非流式调用（completion 端点）

更多推荐文章

相关免费在线工具

多轮对话示例（基于 chat/completions）

带有对话记忆功能测试

函数工具调用测试

Windows 环境下 llama.cpp 编译与 Qwen 模型本地部署指南

1. 克隆代码

2. 编译配置

3. 下载模型

4. 启动 API 服务

5. Python 客户端调用

基础非流式调用（completion 端点）

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

多轮对话示例（基于 chat/completions）

带有对话记忆功能测试

函数工具调用测试