LLAMA.cpp简单教程与实践

优质文章学习记录

08 Apr 2026 — 16 min read

配置环境：硬件是CPU：酷睿i3，2.10GHz主频（支持AVX指令集），内存12GB。Win10,已安装C/C++运行时库，Python。
项目依赖项
项目源
项目构建
项目命令
其他

2.项目依赖项：cmake，下载： Download CMake,正常构建即可。

git，Git - Downloads，注意将项目文件夹添加到Path环境变量下

3.项目源

说明：llama.cpp是一个更新极快的开源项目，在1~6小时内就会有新版本发出，不同版本间有着诸多差异。鉴于这样的更新速度，我的教程写出来前就已经过时了（本教程使用的是4806版），因此教程姑且作为一种参考。

Releases · ggml-org/llama.cpp · GitHub

为了最好地在你的机器上运行完整、高效的代码，建议你下载源码并进行手动编译构建。如果你想直接获得编译好的文件（其中不包含源代码的部分python程序），在上述Releases界面中选择适合你电脑的版本（指令集、架构、操作系统、有无CUDA），解压后直接用命令行执行。

进行手动编译可以直接克隆整个项目文件，或者下载source code。

4.项目构建

如果你和我一样使用zip解压的源代码包开始构建，那么请在该文件夹内打开cmd后先执行git init命令使其变成一个git文件夹，避免构建过程报错。

某些情况下，项目文件可能在不正常情况下编译完成而不会报错。

在命令行窗口先后执行（仅CPU）：

cmake -B build cmake --build build --config Release

llama.cpp/docs/build.md at master · ggml-org/llama.cpp · GitHub构建指令官方文档

可能C/C++编译器会产生许多“强制类型转换”并丧失精度的警告，不用管，只要构建指令完成即可。

为了检查构建过程中的回传信息里是否有报错，我建议复制给AI看看。

构建完毕后，在项目文件夹中的/bin/Release下会产生llama-xxx.exe这样的许多文件，项目中的C/C++部分这样就构建完成了。

将项目文件夹设为PyCharm项目文件夹，新建一个虚拟环境，并按照requirements.txt内容安装好包，这样就可以使用项目中的Python脚本。

构建完毕！

将/bin/Release文件夹添加到“Path”环境变量，在其他文件夹下打开cmd，输入llama-cli，有信息返回即说明配置完毕。

5、命令实践

文件名规范：建议使用全文件名和路径。例子：I:\Restore\Qwen2.5\qwen2.5-72b-instruct-q4_0.gguf

llama-run

I:\Restore\Qwen2.5>llama-run Error: No arguments provided. Description: Runs a llm Usage: llama-run [options] model [prompt] Options: -c, --context-size <value> Context size (default: 2048) --chat-template-file <path> Path to the file containing the chat template to use with the model. Only supports jinja templates and implicitly sets the --jinja flag. --jinja Use jinja templating for the chat template of the model -n, -ngl, --ngl <value> Number of GPU layers (default: 0) --temp <value> Temperature (default: 0.8) -v, --verbose, --log-verbose Set verbosity level to infinity (i.e. log all messages, useful for debugging) -h, --help Show help message Commands: model Model is a string with an optional prefix of huggingface:// (hf://), ollama://, https:// or file://. If no protocol is specified and a file exists in the specified path, file:// is assumed, otherwise if a file does not exist in the specified path, ollama:// is assumed. Models that are being pulled are downloaded with .partial extension while being downloaded and then renamed as the file without the .partial extension when complete. Examples: llama-run llama3 llama-run ollama://granite-code llama-run ollama://smollm:135m llama-run hf://QuantFactory/SmolLM-135M-GGUF/SmolLM-135M.Q2_K.gguf llama-run huggingface://bartowski/SmolLM-1.7B-Instruct-v0.2-GGUF/SmolLM-1.7B-Instruct-v0.2-IQ3_M.gguf llama-run https://example.com/some-file1.gguf llama-run some-file2.gguf llama-run file://some-file3.gguf llama-run --ngl 999 some-file4.gguf llama-run --ngl 999 some-file5.gguf Hello World

llama-cli

llama-cli -m Filename

I:\Restore\Qwen2.5>llama-cli -h register_backend: registered backend CPU (1 devices) register_device: registered device CPU (Intel(R) Core(TM) i3-2310M CPU @ 2.10GHz) load_backend: failed to find ggml_backend_init in E:\LLAMA\llama.cpp-master\bin\Debug\ggml-cpu.dll ----- common params ----- -h, --help, --usage print usage and exit --version show version and build info --completion-bash print source-able bash completion script for llama.cpp --verbose-prompt print a verbose prompt before generation (default: false) -t, --threads N number of threads to use during generation (default: -1) (env: LLAMA_ARG_THREADS) -tb, --threads-batch N number of threads to use during batch and prompt processing (default: same as --threads) -C, --cpu-mask M CPU affinity mask: arbitrarily long hex. Complements cpu-range (default: "") -Cr, --cpu-range lo-hi range of CPUs for affinity. Complements --cpu-mask --cpu-strict <0|1> use strict CPU placement (default: 0) --prio N set process/thread priority : 0-normal, 1-medium, 2-high, 3-realtime (default: 0) --poll <0...100> use polling level to wait for work (0 - no polling, default: 50) -Cb, --cpu-mask-batch M CPU affinity mask: arbitrarily long hex. Complements cpu-range-batch (default: same as --cpu-mask) -Crb, --cpu-range-batch lo-hi ranges of CPUs for affinity. Complements --cpu-mask-batch --cpu-strict-batch <0|1> use strict CPU placement (default: same as --cpu-strict) --prio-batch N set process/thread priority : 0-normal, 1-medium, 2-high, 3-realtime (default: 0) --poll-batch <0|1> use polling to wait for work (default: same as --poll) -c, --ctx-size N size of the prompt context (default: 4096, 0 = loaded from model) (env: LLAMA_ARG_CTX_SIZE) -n, --predict, --n-predict N number of tokens to predict (default: -1, -1 = infinity, -2 = until context filled) (env: LLAMA_ARG_N_PREDICT) -b, --batch-size N logical maximum batch size (default: 2048) (env: LLAMA_ARG_BATCH) -ub, --ubatch-size N physical maximum batch size (default: 512) (env: LLAMA_ARG_UBATCH) --keep N number of tokens to keep from the initial prompt (default: 0, -1 = all) -fa, --flash-attn enable Flash Attention (default: disabled) (env: LLAMA_ARG_FLASH_ATTN) -p, --prompt PROMPT prompt to start generation with; for system message, use -sys --no-perf disable internal libllama performance timings (default: false) (env: LLAMA_ARG_NO_PERF) -f, --file FNAME a file containing the prompt (default: none) -bf, --binary-file FNAME binary file containing the prompt (default: none) -e, --escape process escapes sequences (\n, \r, \t, \', \", \\) (default: true) --no-escape do not process escape sequences --rope-scaling {none,linear,yarn} RoPE frequency scaling method, defaults to linear unless specified by the model (env: LLAMA_ARG_ROPE_SCALING_TYPE) --rope-scale N RoPE context scaling factor, expands context by a factor of N (env: LLAMA_ARG_ROPE_SCALE) --rope-freq-base N RoPE base frequency, used by NTK-aware scaling (default: loaded from model) (env: LLAMA_ARG_ROPE_FREQ_BASE) --rope-freq-scale N RoPE frequency scaling factor, expands context by a factor of 1/N (env: LLAMA_ARG_ROPE_FREQ_SCALE) --yarn-orig-ctx N YaRN: original context size of model (default: 0 = model training context size) (env: LLAMA_ARG_YARN_ORIG_CTX) --yarn-ext-factor N YaRN: extrapolation mix factor (default: -1.0, 0.0 = full interpolation) (env: LLAMA_ARG_YARN_EXT_FACTOR) --yarn-attn-factor N YaRN: scale sqrt(t) or attention magnitude (default: 1.0) (env: LLAMA_ARG_YARN_ATTN_FACTOR) --yarn-beta-slow N YaRN: high correction dim or alpha (default: 1.0) (env: LLAMA_ARG_YARN_BETA_SLOW) --yarn-beta-fast N YaRN: low correction dim or beta (default: 32.0) (env: LLAMA_ARG_YARN_BETA_FAST) -dkvc, --dump-kv-cache verbose print of the KV cache -nkvo, --no-kv-offload disable KV offload (env: LLAMA_ARG_NO_KV_OFFLOAD) -ctk, --cache-type-k TYPE KV cache data type for K allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1 (default: f16) (env: LLAMA_ARG_CACHE_TYPE_K) -ctv, --cache-type-v TYPE KV cache data type for V allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1 (default: f16) (env: LLAMA_ARG_CACHE_TYPE_V) -dt, --defrag-thold N KV cache defragmentation threshold (default: 0.1, < 0 - disabled) (env: LLAMA_ARG_DEFRAG_THOLD) -np, --parallel N number of parallel sequences to decode (default: 1) (env: LLAMA_ARG_N_PARALLEL) --mlock force system to keep model in RAM rather than swapping or compressing (env: LLAMA_ARG_MLOCK) --no-mmap do not memory-map model (slower load but may reduce pageouts if not using mlock) (env: LLAMA_ARG_NO_MMAP) --numa TYPE attempt optimizations that help on some NUMA systems - distribute: spread execution evenly over all nodes - isolate: only spawn threads on CPUs on the node that execution started on - numactl: use the CPU map provided by numactl if run without this previously, it is recommended to drop the system page cache before using this see https://github.com/ggml-org/llama.cpp/issues/1437 (env: LLAMA_ARG_NUMA) -dev, --device <dev1,dev2,..> comma-separated list of devices to use for offloading (none = don't offload) use --list-devices to see a list of available devices (env: LLAMA_ARG_DEVICE) --list-devices print list of available devices and exit -ngl, --gpu-layers, --n-gpu-layers N number of layers to store in VRAM (env: LLAMA_ARG_N_GPU_LAYERS) -sm, --split-mode {none,layer,row} how to split the model across multiple GPUs, one of: - none: use one GPU only - layer (default): split layers and KV across GPUs - row: split rows across GPUs (env: LLAMA_ARG_SPLIT_MODE) -ts, --tensor-split N0,N1,N2,... fraction of the model to offload to each GPU, comma-separated list of proportions, e.g. 3,1 (env: LLAMA_ARG_TENSOR_SPLIT) -mg, --main-gpu INDEX the GPU to use for the model (with split-mode = none), or for intermediate results and KV (with split-mode = row) (default: 0) (env: LLAMA_ARG_MAIN_GPU) --check-tensors check model tensor data for invalid values (default: false) --override-kv KEY=TYPE:VALUE advanced option to override model metadata by key. may be specified multiple times. types: int, float, bool, str. example: --override-kv tokenizer.ggml.add_bos_token=bool:false --lora FNAME path to LoRA adapter (can be repeated to use multiple adapters) --lora-scaled FNAME SCALE path to LoRA adapter with user defined scaling (can be repeated to use multiple adapters) --control-vector FNAME add a control vector note: this argument can be repeated to add multiple control vectors --control-vector-scaled FNAME SCALE add a control vector with user defined scaling SCALE note: this argument can be repeated to add multiple scaled control vectors --control-vector-layer-range START END layer range to apply the control vector(s) to, start and end inclusive -m, --model FNAME model path (default: `models/$filename` with filename from `--hf-file` or `--model-url` if set, otherwise models/7B/ggml-model-f16.gguf) (env: LLAMA_ARG_MODEL) -mu, --model-url MODEL_URL model download url (default: unused) (env: LLAMA_ARG_MODEL_URL) -hf, -hfr, --hf-repo <user>/<model>[:quant] Hugging Face model repository; quant is optional, case-insensitive, default to Q4_K_M, or falls back to the first file in the repo if Q4_K_M doesn't exist. example: unsloth/phi-4-GGUF:q4_k_m (default: unused) (env: LLAMA_ARG_HF_REPO) -hfd, -hfrd, --hf-repo-draft <user>/<model>[:quant] Same as --hf-repo, but for the draft model (default: unused) (env: LLAMA_ARG_HFD_REPO) -hff, --hf-file FILE Hugging Face model file. If specified, it will override the quant in --hf-repo (default: unused) (env: LLAMA_ARG_HF_FILE) -hfv, -hfrv, --hf-repo-v <user>/<model>[:quant] Hugging Face model repository for the vocoder model (default: unused) (env: LLAMA_ARG_HF_REPO_V) -hffv, --hf-file-v FILE Hugging Face model file for the vocoder model (default: unused) (env: LLAMA_ARG_HF_FILE_V) -hft, --hf-token TOKEN Hugging Face access token (default: value from HF_TOKEN environment variable) (env: HF_TOKEN) --log-disable Log disable --log-file FNAME Log to file --log-colors Enable colored logging (env: LLAMA_LOG_COLORS) -v, --verbose, --log-verbose Set verbosity level to infinity (i.e. log all messages, useful for debugging) -lv, --verbosity, --log-verbosity N Set the verbosity threshold. Messages with a higher verbosity will be ignored. (env: LLAMA_LOG_VERBOSITY) --log-prefix Enable prefix in log messages (env: LLAMA_LOG_PREFIX) --log-timestamps Enable timestamps in log messages (env: LLAMA_LOG_TIMESTAMPS) ----- sampling params ----- --samplers SAMPLERS samplers that will be used for generation in the order, separated by ';' (default: penalties;dry;top_k;typ_p;top_p;min_p;xtc;temperature) -s, --seed SEED RNG seed (default: -1, use random seed for -1) --sampling-seq, --sampler-seq SEQUENCE simplified sequence for samplers that will be used (default: edkypmxt) --ignore-eos ignore end of stream token and continue generating (implies --logit-bias EOS-inf) --temp N temperature (default: 0.8) --top-k N top-k sampling (default: 40, 0 = disabled) --top-p N top-p sampling (default: 0.9, 1.0 = disabled) --min-p N min-p sampling (default: 0.1, 0.0 = disabled) --top-nsigma N top-n-sigma sampling (default: -1.0, -1.0 = disabled) --xtc-probability N xtc probability (default: 0.0, 0.0 = disabled) --xtc-threshold N xtc threshold (default: 0.1, 1.0 = disabled) --typical N locally typical sampling, parameter p (default: 1.0, 1.0 = disabled) --repeat-last-n N last n tokens to consider for penalize (default: 64, 0 = disabled, -1 = ctx_size) --repeat-penalty N penalize repeat sequence of tokens (default: 1.0, 1.0 = disabled) --presence-penalty N repeat alpha presence penalty (default: 0.0, 0.0 = disabled) --frequency-penalty N repeat alpha frequency penalty (default: 0.0, 0.0 = disabled) --dry-multiplier N set DRY sampling multiplier (default: 0.0, 0.0 = disabled) --dry-base N set DRY sampling base value (default: 1.75) --dry-allowed-length N set allowed length for DRY sampling (default: 2) --dry-penalty-last-n N set DRY penalty for the last n tokens (default: -1, 0 = disable, -1 = context size) --dry-sequence-breaker STRING add sequence breaker for DRY sampling, clearing out default breakers ('\n', ':', '"', '*') in the process; use "none" to not use any sequence breakers --dynatemp-range N dynamic temperature range (default: 0.0, 0.0 = disabled) --dynatemp-exp N dynamic temperature exponent (default: 1.0) --mirostat N use Mirostat sampling. Top K, Nucleus and Locally Typical samplers are ignored if used. (default: 0, 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0) --mirostat-lr N Mirostat learning rate, parameter eta (default: 0.1) --mirostat-ent N Mirostat target entropy, parameter tau (default: 5.0) -l, --logit-bias TOKEN_ID(+/-)BIAS modifies the likelihood of token appearing in the completion, i.e. `--logit-bias 15043+1` to increase likelihood of token ' Hello', or `--logit-bias 15043-1` to decrease likelihood of token ' Hello' --grammar GRAMMAR BNF-like grammar to constrain generations (see samples in grammars/ dir) (default: '') --grammar-file FNAME file to read grammar from -j, --json-schema SCHEMA JSON schema to constrain generations (https://json-schema.org/), e.g. `{}` for any JSON object For schemas w/ external $refs, use --grammar + example/json_schema_to_grammar.py instead ----- example-specific params ----- --no-display-prompt don't print prompt at generation (default: false) -co, --color colorise output to distinguish prompt and user input from generations (default: false) --no-context-shift disables context shift on infinite text generation (default: disabled) (env: LLAMA_ARG_NO_CONTEXT_SHIFT) -sys, --system-prompt PROMPT system prompt to use with model (if applicable, depending on chat template) -ptc, --print-token-count N print token count every N tokens (default: -1) --prompt-cache FNAME file to cache prompt state for faster startup (default: none) --prompt-cache-all if specified, saves user input and generations to cache as well --prompt-cache-ro if specified, uses the prompt cache but does not update it -r, --reverse-prompt PROMPT halt generation at PROMPT, return control in interactive mode -sp, --special special tokens output enabled (default: false) -cnv, --conversation run in conversation mode: - does not print special tokens and suffix/prefix - interactive mode is also enabled (default: auto enabled if chat template is available) -no-cnv, --no-conversation force disable conversation mode (default: false) -i, --interactive run in interactive mode (default: false) -if, --interactive-first run in interactive mode and wait for input right away (default: false) -mli, --multiline-input allows you to write or paste multiple lines without ending each in '\' --in-prefix-bos prefix BOS to user inputs, preceding the `--in-prefix` string --in-prefix STRING string to prefix user inputs with (default: empty) --in-suffix STRING string to suffix after user inputs with (default: empty) --no-warmup skip warming up the model with an empty run -gan, --grp-attn-n N group-attention factor (default: 1) (env: LLAMA_ARG_GRP_ATTN_N) -gaw, --grp-attn-w N group-attention width (default: 512) (env: LLAMA_ARG_GRP_ATTN_W) --jinja use jinja template for chat (default: disabled) (env: LLAMA_ARG_JINJA) --reasoning-format FORMAT reasoning format (default: deepseek; allowed values: deepseek, none) controls whether thought tags are extracted from the response, and in which format they're returned. 'none' leaves thoughts unparsed in `message.content`, 'deepseek' puts them in `message.reasoning_content` (for DeepSeek R1 & Command R7B only). only supported for non-streamed responses (env: LLAMA_ARG_THINK) --chat-template JINJA_TEMPLATE set custom jinja chat template (default: template taken from model's metadata) if suffix/prefix are specified, template will be disabled only commonly used templates are accepted (unless --jinja is set before this flag): list of built-in templates: chatglm3, chatglm4, chatml, command-r, deepseek, deepseek2, deepseek3, exaone3, falcon3, gemma, gigachat, glmedge, granite, llama2, llama2-sys, llama2-sys-bos, llama2-sys-strip, llama3, megrez, minicpm, mistral-v1, mistral-v3, mistral-v3-tekken, mistral-v7, monarch, openchat, orion, phi3, phi4, rwkv-world, vicuna, vicuna-orca, zephyr (env: LLAMA_ARG_CHAT_TEMPLATE) --chat-template-file JINJA_TEMPLATE_FILE set custom jinja chat template file (default: template taken from model's metadata) if suffix/prefix are specified, template will be disabled only commonly used templates are accepted (unless --jinja is set before this flag): list of built-in templates: chatglm3, chatglm4, chatml, command-r, deepseek, deepseek2, deepseek3, exaone3, falcon3, gemma, gigachat, glmedge, granite, llama2, llama2-sys, llama2-sys-bos, llama2-sys-strip, llama3, megrez, minicpm, mistral-v1, mistral-v3, mistral-v3-tekken, mistral-v7, monarch, openchat, orion, phi3, phi4, rwkv-world, vicuna, vicuna-orca, zephyr (env: LLAMA_ARG_CHAT_TEMPLATE_FILE) --simple-io use basic IO for better compatibility in subprocesses and limited consoles example usage: text generation: llama-cli -m your_model.gguf -p "I believe the meaning of life is" -n 128 chat (conversation): llama-cli -m your_model.gguf -p "You are a helpful assistant" -cnv

llama-quantize:用于量化和重新量化

简单来说，对于已经量化的gguf文件，必须使用--allow-requantize,命令行格式为：

llama-quantize --allow-quantize File1.gguf File2.gguf Q3_k

filename2.gguf是你为量化后的文件所准备的文件名，量化规格参数和全部的命令行格式为：

I:\Restore\Qwen2.5>llama-quantize usage: llama-quantize [--help] [--allow-requantize] [--leave-output-tensor] [--pure] [--imatrix] [--include-weights] [--exclude-weights] [--output-tensor-type] [--token-embedding-type] [--override-kv] model-f32.gguf [model-quant.gguf] type [nthreads] --allow-requantize: Allows requantizing tensors that have already been quantized. Warning: This can severely reduce quality compared to quantizing from 16bit or 32bit --leave-output-tensor: Will leave output.weight un(re)quantized. Increases model size but may also increase quality, especially when requantizing --pure: Disable k-quant mixtures and quantize all tensors to the same type --imatrix file_name: use data in file_name as importance matrix for quant optimizations --include-weights tensor_name: use importance matrix for this/these tensor(s) --exclude-weights tensor_name: use importance matrix for this/these tensor(s) --output-tensor-type ggml_type: use this ggml_type for the output.weight tensor --token-embedding-type ggml_type: use this ggml_type for the token embeddings tensor --keep-split: will generate quantized model in the same shards as input --override-kv KEY=TYPE:VALUE Advanced option to override model metadata by key in the quantized model. May be specified multiple times. Note: --include-weights and --exclude-weights cannot be used together Allowed quantization types: 2 or Q4_0 : 4.34G, +0.4685 ppl @ Llama-3-8B 3 or Q4_1 : 4.78G, +0.4511 ppl @ Llama-3-8B 8 or Q5_0 : 5.21G, +0.1316 ppl @ Llama-3-8B 9 or Q5_1 : 5.65G, +0.1062 ppl @ Llama-3-8B 19 or IQ2_XXS : 2.06 bpw quantization 20 or IQ2_XS : 2.31 bpw quantization 28 or IQ2_S : 2.5 bpw quantization 29 or IQ2_M : 2.7 bpw quantization 24 or IQ1_S : 1.56 bpw quantization 31 or IQ1_M : 1.75 bpw quantization 36 or TQ1_0 : 1.69 bpw ternarization 37 or TQ2_0 : 2.06 bpw ternarization 10 or Q2_K : 2.96G, +3.5199 ppl @ Llama-3-8B 21 or Q2_K_S : 2.96G, +3.1836 ppl @ Llama-3-8B 23 or IQ3_XXS : 3.06 bpw quantization 26 or IQ3_S : 3.44 bpw quantization 27 or IQ3_M : 3.66 bpw quantization mix 12 or Q3_K : alias for Q3_K_M 22 or IQ3_XS : 3.3 bpw quantization 11 or Q3_K_S : 3.41G, +1.6321 ppl @ Llama-3-8B 12 or Q3_K_M : 3.74G, +0.6569 ppl @ Llama-3-8B 13 or Q3_K_L : 4.03G, +0.5562 ppl @ Llama-3-8B 25 or IQ4_NL : 4.50 bpw non-linear quantization 30 or IQ4_XS : 4.25 bpw non-linear quantization 15 or Q4_K : alias for Q4_K_M 14 or Q4_K_S : 4.37G, +0.2689 ppl @ Llama-3-8B 15 or Q4_K_M : 4.58G, +0.1754 ppl @ Llama-3-8B 17 or Q5_K : alias for Q5_K_M 16 or Q5_K_S : 5.21G, +0.1049 ppl @ Llama-3-8B 17 or Q5_K_M : 5.33G, +0.0569 ppl @ Llama-3-8B 18 or Q6_K : 6.14G, +0.0217 ppl @ Llama-3-8B 7 or Q8_0 : 7.96G, +0.0026 ppl @ Llama-3-8B 1 or F16 : 14.00G, +0.0020 ppl @ Mistral-7B 32 or BF16 : 14.00G, -0.0050 ppl @ Mistral-7B 0 or F32 : 26.00G @ 7B COPY : only copy tensors, no quantizing

llama-gguf-split:用于切分和组合GGUF文件（对于上传下载来说极为有利）

注意：合并时只需在inname出填写系列文件中的第一个文件，llama会自行在当前文件夹中寻找该系列（也是命名系列）的其他文件一起合并。合并时需加--merge参数

I:\Restore\Qwen2.5>llama-gguf-split error: bad arguments usage: llama-gguf-split [options] GGUF_IN GGUF_OUT Apply a GGUF operation on IN to OUT. options: -h, --help show this help message and exit --version show version and build info --split split GGUF to multiple GGUF (enabled by default) --merge merge multiple GGUF to a single GGUF --split-max-tensors max tensors in each split (default: 128) --split-max-size N(M|G) max size per split --no-tensor-first-split do not add tensors to the first split (disabled by default) --dry-run only print out a split plan and exit, without writing any new files

Safetensors转gguf用convert_hf_to_gguf.py

6、其他：

modelscope的下载速度相当高，很大程度上取决于你的网络环境。

模型文件大多数非常大，请留足存储空间。

若使用命令行没有回复信息（即使报错也没有回传信息），请重新构建或下载其他版本进行构建

内存占用较高。

LLAMA.cpp简单教程与实践

优质文章学习记录

Read more

Cursor实战：Web版背单词应用开发演示

强烈建议收藏！2026热门AI编程工具推荐，分场景TOP7：覆盖前端/后端/云原生

前端防范 XSS（跨站脚本攻击）

JavaScript事件循环（下） - requestAnimationFrame与Web Workers