Voice assistants have become ubiquitous, but most rely entirely on cloud services for speech processing. What if you could run the Speech to Text(STT) and Text to Speech(TTS) models locally and leverage a single cloud service for the chat model? This article goes through some of the hardware and services needed to have your very own (mostly) localized voice assistant.
In order to interface with our models over voice, the Home Assistant team has made a very approachable piece of hardware called the Home Assistant Voice PE. It retails for about $60 and has everything you need to talk to your system.
The other thing we need is a place to run Home Assistant. Since we are offloading the heavier Conversation Agent/LLM to a cloud service (Groq), the hardware requirements are pretty minimal. Something like an old Raspberry Pi 4 with 4GB+ RAM works fine. For sub-second response times, you'll want an Intel-based system like an old laptop or NUC with a Core i5 or better.
That being said you can also run the Conversation Agent locally but the requirements are a little more intensive. You could run something like qwen2.5:8b on more modest hardware (a 3060 with 12GB VRAM would work), but the inference speed on consumer GPUs is much slower than cloud providers. According to recent benchmarks, an RTX 3060 gets ~38 tokens/second on 8B models, while even an RTX 4090 tops out around 90-130 tokens/second—compared to ~500 tokens/second on Groq. For a voice assistant where response time matters, that's the difference between a snappy 1-2 second reply and an awkward 5-10 second pause. Luckily if you don't have a gaming machine lying around we can lean on Groq, which has a very approachable pricing tier and fast inference speeds.
The system orchestrates multiple Docker containers, each handling a specific responsibility:
┌─────────────────────────────────────────────────────────────────────────────┐
│ Voice Assistant Stack │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ │
│ │ Voice Assistant │ │
│ │ PE │ │
│ └────────┬────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Home Assistant │ │
│ │ │ │
│ │ ┌─────────┐ ┌───────────────────────┐ ┌───────────┐ │ │
│ │ │ Whisper │───▶│ Extended OpenAI │───▶│ Piper │ │ │
│ │ │ (STT) │ │ Conversation │ │ (TTS) │ │ │
│ │ └─────────┘ └───────────┬───────────┘ └───────────┘ │ │
│ │ │ │ │
│ └──────────────────────────────┼──────────────────────────────────────┘ │
│ │ │
│ ┌────────────────────┴────────────────────┐ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────────────────┐ │
│ │ Groq Cloud │ │ Web Service │ │
│ │ (Convo │ │ (handles tool calls) │ │
│ │ Agent) │ │ ┌────────────────────┐ │ │
│ └──────────────┘ │ │ SearXNG (queries) │ │ │
│ │ └────────────────────┘ │ │
│ └──────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Groq offers several advantages for this use case:
llama-3.1-8b-instant — more than enough for personal use (20-50 queries/day won't come close)The llama-3.1-8b-instant model provides excellent quality-to-speed ratio for voice assistant responses.
When you speak to the assistant:
Everything for this stack runs through 3 services:
This will handle interfacing with the Voice Assistant but for tool calling we are going to utilize a web service and a Home Assistant custom component called Extended OpenAI Conversation. That will allow us to define different tools to call as the Conversation Agent processes the intent from our parsed voice inputs. This will allow us to define tool calls like a web search like this:
- spec:
name: search_web
description: >-
Search the internet for real-time information (prices, news, sports, stocks).
Return ONLY the exact text from this tool. Do not add notes, disclaimers, or explanations.
parameters:
type: object
properties:
query:
type: string
description: The search query to find information
required:
- query
function:
type: rest
method: POST
resource: 'http://YOUR_SERVER_IP:8765/search'
headers:
Content-Type: application/json
timeout: 60
payload_template: '{{ {"query": query} | to_json }}'
value_template: '{{ value_json.message }}'
This gives us an extensible way to update tool definitions inside of Extended OpenAI Conversation and their functionality into a simple Web Service API.
The basic docker-compose configuration would look like:
services:
searxng:
image: searxng/searxng:latest
container_name: voice-assistant-searxng
ports:
- "8080:8080"
volumes:
- ./data/searxng:/etc/searxng:rw
restart: unless-stopped
homeassistant:
image: ghcr.io/home-assistant/home-assistant:stable
container_name: voice-assistant-homeassistant
network_mode: host
volumes:
- ./data/home-assistant:/config
- ./home-assistant.log:/config/home-assistant.log
- /etc/localtime:/etc/localtime:ro
environment:
- TZ=America/Chicago
restart: unless-stopped
privileged: true
whisper:
image: rhasspy/wyoming-whisper:latest
container_name: whisper
ports:
- "10300:10300"
volumes:
- ./data/whisper:/data
command: --model tiny
restart: unless-stopped
piper:
image: rhasspy/wyoming-piper:latest
container_name: piper
ports:
- "10200:10200"
volumes:
- ./data/piper:/data
command: --voice en_US-lessac-medium
restart: unless-stopped
Once the containers are running, you'll need to configure Home Assistant to wire everything together.
Go to Settings → Devices & Services → Add Integration and add:
1030010200This custom component isn't in the default HACS repository, so you'll need to add it manually:
https://github.com/jekalmin/extended_openai_conversation as an IntegrationGo to Settings → Devices & Services → Add Integration → Extended OpenAI Conversation and configure:
https://api.groq.com/openai/v1llama-3.1-8b-instantAfter adding, click Configure on the integration to add your tool definitions (like the search example above).
Go to Settings → Voice assistants → Add Assistant and configure:
Power on your Voice PE and it should appear in Home Assistant automatically via the ESPHome integration. Once discovered:
You should now be able to speak to the Voice PE and have it route through Whisper → Groq → Piper.
You can setup whatever HTTP service you like for handling the tool calls. I've setup a simple node service using fastify.
import Fastify from 'fastify'
const GROQ_API_KEY = process.env.GROQ_API_KEY
const SEARXNG_URL = process.env.SEARXNG_URL || 'http://localhost:8080'
interface SearchResult {
title: string
url: string
content: string
}
interface SearxngResponse {
results: SearchResult[]
}
interface GroqResponse {
choices: Array<{
message: { content: string }
}>
}
async function search(query: string): Promise<string> {
// Step 1: Query SearXNG metasearch engine
const searchResponse = await fetch(
`${SEARXNG_URL}/search?q=${encodeURIComponent(query)}&format=json`
)
const searchData = (await searchResponse.json()) as SearxngResponse
const results = searchData.results.slice(0, 5)
if (results.length === 0) {
return "I couldn't find any relevant information for that query."
}
// Step 2: Build context from search results
const context = results
.map((r, i) => `[${i + 1}] ${r.title}: ${r.content}`)
.join('\n\n')
// Step 3: Summarize with Groq LLM
const groqResponse = await fetch(
'https://api.groq.com/openai/v1/chat/completions',
{
method: 'POST',
headers: {
Authorization: `Bearer ${GROQ_API_KEY}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({
model: 'llama-3.1-8b-instant',
messages: [
{
role: 'system',
content: 'Answer in EXACTLY 1-2 short sentences. Be concise and conversational, suitable for voice output.',
},
{
role: 'user',
content: `Based on these search results, answer: "${query}"\n\n${context}`,
},
],
temperature: 0.3,
max_tokens: 150,
}),
}
)
const completion = (await groqResponse.json()) as GroqResponse
return completion.choices[0].message.content
}
const fastify = Fastify({ logger: true })
fastify.get('/health', async () => ({ status: 'ok' }))
fastify.post<{ Body: { query: string } }>('/search', async (request, reply) => {
const { query } = request.body
if (!query) {
return reply.status(400).send({ error: 'query is required' })
}
try {
const message = await search(query)
return { status: 'success', message }
} catch (error) {
fastify.log.error(error)
return reply.status(500).send({ status: 'error', message: 'Search failed' })
}
})
fastify.listen({ port: 8765, host: '0.0.0.0' })
The service runs alongside the other containers. Here's the Dockerfile:
FROM node:20-alpine
WORKDIR /app
COPY package*.json ./
RUN npm install
COPY . .
RUN npm run build
EXPOSE 8765
CMD ["npm", "start"]
And add it to your docker-compose:
http-service:
build: ./http-service
container_name: voice-assistant-http-service
ports:
- "8765:8765"
environment:
- GROQ_API_KEY=${GROQ_API_KEY}
- SEARXNG_URL=http://searxng:8080
depends_on:
- searxng
restart: unless-stopped
You can test the endpoint directly:
curl -X POST http://localhost:8765/search \
-H "Content-Type: application/json" \
-d '{"query": "what is the weather in New York"}'
The full HTTP service code is available in the example repository.
Building a voice assistant on a Raspberry Pi demonstrates that practical AI applications don't require expensive cloud infrastructure for every component. By running speech processing locally and strategically using cloud APIs for LLM inference, you can create a responsive, privacy-respecting system.
The Home Assistant and Extended OpenAI Conversation integration is the key that unlocks this flexibility - it lets you define arbitrary tools that the conversation agent can invoke, turning simple REST APIs into voice-controlled capabilities. The modular architecture makes it straightforward to add new features, and the tool specification format provides a clean contract between the LLM and your services.
Whether you're automating your home, learning about AI integration, or just want a voice assistant you control, this stack provides a solid foundation.
For a complete, runnable example with all the code from this post, check out the voice-assistant-example repository.