Whisper-large-v3 语音识别效果实测与工程落地

Whisper-large-v3 语音识别效果实测与工程落地 | 极客日志

# 1. 依赖安装（Ubuntu 24.04）
apt-get update && apt-get install -y ffmpeg pip install -r requirements.txt
# 2. 启动（首次运行自动下载 large-v3.pt）
python3 app.py
# 3. 打开浏览器，输入 http://localhost:7860

import subprocess
import numpy as np
from scipy.io import wavfile
import tempfile
import os

def robust_audio_preprocess(input_path: str, output_path: str = None) -> str:
    """
    针对 Whisper 优化的音频预处理：
    - 统一采样率至 16kHz
    - 去除爆音（clip detection）
    - 动态范围压缩（提升信噪比）
    - 保存为 WAV 无损格式
    """
    if output_path is None:
        output_path = tempfile.mktemp(suffix=".wav")
    # 步骤 1：FFmpeg 标准化（去爆音 + 重采样）
    cmd = [
        "ffmpeg", "-y", "-i", input_path,
        "-ar", "16000", "-ac", "1",
        "-af", "acompressor=threshold=-20dB:ratio=4:attack=5:release=50",
        "-acodec", "pcm_s16le", output_path
    ]
    subprocess.run(cmd, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
    # 步骤 2：二次降噪（librosa）
    sample_rate, audio = wavfile.read(output_path)
    if len(audio.shape) > 1:
        audio = audio.mean(axis=1)
    try:
        import noisereduce as nr
        audio_clean = nr.reduce_noise(
            y=audio.astype(np.float32), sr=sample_rate,
            stationary=True, prop_decrease=0.75
        )
        wavfile.write(output_path, sample_rate, audio_clean.astype(np.int16))
    except ImportError:
        pass
    return output_path

# 使用示例
clean_wav = robust_audio_preprocess("noisy_meeting.mp3")
print(f"预处理完成，输出路径：{clean_wav}")

import re

def add_punctuation_and_capitalize(text: str) -> str:
    """
    规则驱动的标点 + 大小写修复（零依赖，纯 Python）
    """
    # 1. 句首大写
    text = re.sub(r'^([a-z])', lambda m: m.group(1).upper(), text)
    # 2. 句号/问号/感叹号后空格 + 大写
    text = re.sub(r'([.!?])\s+([a-z])', lambda m: m.group(1) + ' ' + m.group(2).upper(), text)
    # 3. 添加缺失句号（结尾无标点则补）
    if not re.search(r'[.!?]$', text.strip()):
        text = text.strip() + '.'
    # 4. 英文缩写保护
    text = re.sub(r'\b(Mr|Mrs|Ms|Dr|Prof|St|Ave|USA|UK|EU)\.', r'\1@', text)
    text = re.sub(r'@', '.', text)
    # 5. 中文引号智能包裹
    if '"' not in text and '"' in text:
        text = text.replace('"', '"', 1).replace('"', '"', 1)
    return text

# 使用示例
raw = "hello how are you i am fine thank you"
fixed = add_punctuation_and_capitalize(raw)
print(fixed)

import re

def normalize_numbers_in_text(text: str, target_lang: str = "zh") -> str:
    """
    多语言数字标准化（支持中/英/法/德/日/西）
    """
    # 法语千位分隔符
    text = re.sub(r'(\d)\s+mill(e|ions?)', r'\1000', text, flags=re.IGNORECASE)
    text = re.sub(r'(\d)\s+mille', r'\1000', text, flags=re.IGNORECASE)
    # 德语数字映射
    num_map_de = {"eins": "1", "zwei": "2", "drei": "3", "vier": "4", "fünf": "5"}
    for de, num in num_map_de.items():
        text = re.sub(rf'\b{de}\b', num, text, flags=re.IGNORECASE)
    # 日语汉字数字
    text = re.sub(r'(\d+) 万', lambda m: str(int(m.group(1)) * 10000), text)
    text = re.sub(r'(\d+) 億', lambda m: str(int(m.group(1)) * 100000000), text)
    # 中文数字
    text = re.sub(r'(\d+) 万', lambda m: str(int(m.group(1)) * 10000), text)
    text = re.sub(r'(\d+) 亿', lambda m: str(int(m.group(1)) * 100000000), text)
    return text

# 使用示例
fr_text = "Le montant est de deux mille vingt-trois euros"
print(normalize_numbers_in_text(fr_text))

Whisper-large-v3 语音识别效果实测与工程落地

Whisper-large-v3 语音识别效果实测

测试概述

测试环境与基础能力

测试配置

核心能力验证

快速部署流程

准确率实测分析

跨国会议录音（英语 + 日语 + 中文三语混杂）

粤普混合采访（广东话 + 普通话）

带背景音乐的播客（中文，BGM 音量约 -12dB）

语速飞快的日语新闻（220 字/分钟）

印度英语客服对话（浓重口音，含 Hindi 借词）

AI 语音助手三语测试（德语→西班牙语→中文，5 秒内切换）

潜在问题与边界分析

方言连续体：闽南语 vs 潮汕话

极端噪声：地铁站广播

专业术语：医疗报告中的'TSH'

数字读法：法语'vingt-trois'

同音词歧义：'期中考试'vs'其中考试'

长时静音：40 秒空白后突然说话

工程落地解决方案

音频预处理：修复高噪声、低采样率、爆音问题

标点与大小写补全

多语言数字标准化

总结

更多推荐文章

相关免费在线工具

Whisper-large-v3 语音识别效果实测与工程落地

Whisper-large-v3 语音识别效果实测

测试概述

测试环境与基础能力

测试配置

核心能力验证

快速部署流程

准确率实测分析

跨国会议录音（英语 + 日语 + 中文三语混杂）

粤普混合采访（广东话 + 普通话）

带背景音乐的播客（中文，BGM 音量约 -12dB）

语速飞快的日语新闻（220 字/分钟）

印度英语客服对话（浓重口音，含 Hindi 借词）

AI 语音助手三语测试（德语→西班牙语→中文，5 秒内切换）

潜在问题与边界分析

方言连续体：闽南语 vs 潮汕话

极端噪声：地铁站广播

专业术语：医疗报告中的'TSH'

数字读法：法语'vingt-trois'

同音词歧义：'期中考试'vs'其中考试'

长时静音：40 秒空白后突然说话

工程落地解决方案

音频预处理：修复高噪声、低采样率、爆音问题

标点与大小写补全

多语言数字标准化

总结

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具