window.huggingface={variables:{"SPACE_CREATOR_USER_ID":"69f581b71986e1d42596490d"}};d> Miwa — Real-Time Discord Voice Translation

AMD MI300X · 192 GB HBM3 · Llama 3.3 70B · FP16

Miwa 美話

Language barriers cost gaming communities millions of shared moments every day. Miwa eliminates the gap — per-speaker Discord voice translation with romaji, a three-agent CrewAI suggestion pipeline, and style-matched LLM refinement, targeting under 800ms on AMD MI300X.

Francis Daniel Genese (Mizu) AMD Developer Hackathon 2026 · lablab.ai · Track 1: AI Agents & Agentic Workflows · May 4–11, 2026

View on GitHub See how it works Like on HF Spaces

<800ms

End-to-end latency

70B

Model parameters

AI reply suggestions

FP16

Precision on MI300X

200M+

Monthly Discord users

Discord has no native real-time translation. Miwa is a transparent overlay that requires nothing from the other person.

<800ms

End-to-end latency target

Fast enough to follow the conversation as it happens — no awkward lag, no interrupting the flow of the call

New hardware required

Pure software overlay — runs on any PC already running Discord, no special headsets or upgrades needed

Demo

What you see during a call

The Miwa overlay sits transparently above your game or browser. Per-speaker cards appear as each person talks, fading out when they go silent.

Word-by-word karaoke highlight

Each word lights up in sync with the speaker's voice as openai-whisper emits word-level timestamps.

Two-pass translation

A fast translation appears the moment speech ends. The LLM then refines it to your chosen style — the card updates in place with no flicker.

Reply with one key

Press 1, 2, or 3 to instantly send that suggestion to Discord chat. Each card also has individual buttons — Bot Speaks in VC, Bot Sends in text, or I’ll Speak (opens fullscreen romaji).

AI agents generate each suggestion Track 1

Three CrewAI agents — Analyst → Strategist → Writer — run on AMD MI300X to produce context-aware Japanese replies. Per-speaker Qdrant vector memory means suggestions improve as the conversation continues.

Miwa

connected

Formal Neutral Casual Gaming

Opacity

100%

◑ − □ ?

Type English — translates as Casual…

🍁 Valo Tomodachi 🌵 · #🔥 | ざつだん 🍁_ · 1 in call ▾

`Mizu

lol kusa 草w

💬 🔊

Ehh?! ee~ えー！

💬 🔊

Seriously? maji de? マジで？

💬 🔊

Insane! yabai! やばい！

💬 🔊

Nice one! ii ne! いいね！

💬 🔊

Speakers ▾

`Mizu 🎤

それは面白いですね

soreha omoshiroi desune

That’s interesting.

Reply With

↻ ▾

そうですね

Yes, it is

📌

🔊 Bot Speaks 💬 Bot Sends 👤 I’ll Speak

どうして面白いと思ったんですか

Why do you think it’s interesting?

📌

🔊 Bot Speaks 💬 Bot Sends 👤 I’ll Speak

私も興味があります

I’m interested too

📌

🔊 Bot Speaks 💬 Bot Sends 👤 I’ll Speak

· LLM ~500ms · WS open

Pipeline

Voice to translation targeting under 800ms

Audio
Capture

<5ms

openai-whisper
STT

~150ms

Google
Translate

<100ms

Llama 3.3
70B FP16

~500ms

CrewAI
Agents

~8–15s†

Tauri
Overlay

<20ms

♪ PCM

miwa — live pipeline simulation

01 — Audio Capture <5ms

IN Discord Opus stream (per-speaker, 48 kHz)

OUT

Latency breakdown target <800ms end-to-end

0ms200ms400ms600ms800ms

01 Capture (<5ms)

02 Transcription (~150ms)

03 Fast Translation (<100ms)

04 LLM refinement (~500ms)

05 Overlay (<20ms) — 06 Suggestions deferred (~8–15s, agentic pipeline)

Video

Watch it in a live call

Full end-to-end walkthrough recorded live on AMD MI300X hardware.

Infrastructure

Why AMD MI300X matters for this

Used in Miwa — AMD Developer Cloud

AMD MI300X

ROCm 7.2 · vLLM 0.17.1 · PyTorch 2.6.0 · 20 vCPU · 240 GB RAM

192GB

HBM3 VRAM

5.3TB/s

Mem bandwidth

FP16

Full precision

	RTX 5090 — Consumer	NVIDIA H100 — Data center	AMD MI300X ✓ Miwa
VRAM	32 GB GDDR7X	80 GB HBM2e	192 GB HBM3
Llama 3.3 70B	INT4 only — quality loss	FP8/INT8 — barely fits	Full FP16, single GPU
Bandwidth	1.79 TB/s	3.35 TB/s	5.3 TB/s
Multi-GPU	Required for 70B	Often needed	Not needed
Ecosystem	CUDA only	CUDA only	ROCm — open source

Zero quantization loss

192GB unified HBM3 memory fits Llama 3.3 70B entirely in FP16. No INT4/INT8 rounding errors that degrade Japanese nuance and translation style.

5.3 TB/s memory bandwidth

Token generation is memory-bound, not compute-bound. The MI300X moves model weights to compute units 3× faster than an RTX 5090 — every transformer layer stays fed without stalling, which is why full FP16 inference completes under 500ms.

ROCm open ecosystem

openai-whisper, vLLM 0.17.1, and PyTorch 2.6.0 all run on ROCm 7.2 — the same PyTorch code that runs on CUDA runs here with a single env flag.

Multi-speaker concurrency

Discord voice channels can have 10+ speakers. The MI300X handles concurrent openai-whisper transcription and vLLM inference requests without GPU contention.

Features

Everything you need to stay in the conversation

Two-pass translation

Instant display, then refined

Google Translate fires in under 100ms so you read immediately. Llama 3.3 70B follows with a style-aware refinement — Formal, Neutral, Casual, or Gaming — updating the card in place.

Google

<100ms

LLM

~500ms

Always-on-top overlay

Tauri v2 window, zero distraction

Transparent, frameless, always-on-top Tauri window sits over your game or browser. Drag the header to reposition. Double-click to snap to corners. Resize vertically from the bottom edge.

Quick Reply

Type English, see Japanese live

Debounced auto-translation as you type. Preview your message in Japanese and romaji before sending — the bot delivers it in your chosen style.

Shortcuts

Keyboard-first workflow

123Send suggestion to chat

Ctrl+1–9Phrasebook slots

?Show all shortcuts

EscClose romaji popup

Quick Reactions

Mode-adaptive reactions library

80 pre-written reactions (20 per mode) that automatically swap when you change style — Casual, Gaming, Formal, Neutral. Live search filters by JP text, romaji, or English. One click sends via Bot Sends (chat) or Bot Speaks (TTS in voice channel).

草ｗえー！マジで？やばい！いいね！ +75

Memory

Per-speaker Qdrant vector store

Each utterance is embedded and stored against the speaker’s Discord user ID. When the CrewAI Analyst agent runs, it retrieves the most relevant past exchanges via vector similarity — giving the suggestion pipeline actual conversation context, not just the last sentence. Suggestions get more accurate as the call progresses.

Romaji

Fullscreen popup

Hit 3 to open a large romaji pronunciation overlay and speak the reply yourself.

Stack

Built on AMD MI300X and open-source AI

Primary GPU

AMD MI300X

192GB HBM3 unified memory — runs Llama 3.3 70B in full FP16 with zero quantization loss

192GB VRAM FP16 ROCm 7.2 5.3 TB/s vLLM 0.17.1

Inference

Serving

vLLM + ROCm

STT

openai-whisper

AI & Agents

LLM

Llama 3.3 70B

Alt. LLM

Qwen2.5 72B

Agents

CrewAI

Vector DB

Qdrant

TTS

edge-tts

Translation

Google Translate

Romaji

pykakasi

Application

Server

FastAPI

Cache

SQLite WAL

Desktop

Tauri v2

Runtime

Rust

React 19

Animation

Framer Motion

Discord

discord.js v14

Engineering

Built to production standard

System architecture — data flow

A friend speaks Japanese in Discord

Each person’s voice is captured as a separate stream — not mixed with others. Miwa listens only to the Discord call and forwards the audio over an encrypted connection to AMD’s AI cloud for processing.

Node.js 18 · discord.js v14 · @discordjs/voice · per-speaker Opus stream · SSH tunnel

↓

Encrypted SSH tunnel → AMD AI cloud

The AI cloud converts speech and translates it

Speech recognition converts the audio to Japanese text. Google Translate returns an English result in under 100ms — shown to you immediately. A large language model then refines the translation to match your chosen style (formal, casual, or gaming slang). Three context-aware reply suggestions are generated based on the conversation.

AMD MI300X · 192 GB HBM3 · openai-whisper STT · Llama 3.3 70B FP16 · vLLM 0.17.1 · Google Translate · CrewAI agents · Qdrant memory

↓

WebSocket JSON → local machine · total <800ms

You see the translation floating above Discord

A transparent window stays on top of your screen without covering Discord. You see the Japanese text, a romaji pronunciation guide so you can read it aloud, and the English translation. Select a suggested reply and the Discord bot delivers it — or tap “I’ll Speak” to say it yourself using the phonetic guide.

Tauri v2 · Rust · React 19 · TypeScript strict · Jotai · Framer Motion · transparent always-on-top window

Step 02 in detail

Inside the AI cloud — AMD MI300X · 192 GB HBM3

Python · ROCm 7.2 · vLLM 0.17.1 · PyTorch 2.6.0

STT

openai-whisper

PCM → Japanese text + word-level timestamps for karaoke

<150ms

→

TRANSLATE · pass 1

Google Translate

Fast EN result → shown immediately as fast packet

<100ms

TRANSLATE · pass 2

vLLM — Llama 3.3 70B FP16

Style-refined EN → updates card as refined packet

<700ms

→

ROMAJI

pykakasi

JP text → phonetic reading for pronunciation guide

<5ms

→

AGENTS — deferred, does not block translation

CrewAI 3-agent pipeline · vLLM + Qdrant

Analyst — retrieves top-K past utterances from Qdrant (per-speaker vector memory), assesses conversation context and emotional register
Strategist — receives Analyst brief, decides reply type (agreement / question / reaction / game callout), produces structured handoff
Writer — generates 3 style-matched Japanese reply suggestions from Strategist’s brief · each suggestion pre-synthesized via edge-tts

~8–15s (deferred)

JavaScript — Discord bot

Python — AI server

Rust — Tauri backend

TypeScript — React UI

Python — FastAPI AI server (AMD MI300X)

Key engineering decisions

openai-whisper, not WhisperX

WhisperX uses CTranslate2 for inference acceleration — CTranslate2 has no ROCm backend. Running WhisperX on AMD MI300X silently falls back to CPU: 10–20× slower. openai-whisper runs natively on PyTorch + ROCm with full GPU utilization. Not a preference — the alternative would have broken the latency target entirely.

Tauri, not Electron

50 MB binary vs Electron’s 200+ MB. The Rust backend gives native always-on-top transparent windows with no chrome, direct OS API access, and a real security sandbox — non-negotiable for a persistent screen overlay.

SSH tunnel, not public port

All traffic between the local bot and AMD cloud routes through SSH port 22. Zero firewall rules, zero exposed WebSocket endpoints, zero attack surface. The production path is identical to the development path.

Per-speaker, not mixed audio

Discord’s VoiceReceiver provides separate Opus streams per user ID. Mixing destroys speaker identity before transcription. Each stream gets its own openai-whisper task, enabling per-speaker karaoke cards and context-aware reply suggestions.

FP16 + 80% VRAM cap

192 GB HBM3 fits Llama 3.3 70B in full FP16 — no INT4/INT8. Japanese honorifics, register, and verb endings encode in the tail of the probability distribution; INT4 truncates that tail. vLLM is capped at --gpu-memory-utilization 0.80, reserving ~38 GB for Whisper. The default 100% leaves zero VRAM for STT and fails silently at runtime.

pykakasi, not MeCab

MeCab requires a native shared library compiled against a specific OS and dictionary version — inside a ROCm Docker container with non-standard Python, that is a multi-hour dependency tangle. pykakasi is pure Python: one pip install, zero native deps, zero build step. The romaji quality difference is negligible for pronunciation guidance.