Adam WolffAdam Wolff
BlogProjectsAbout
© Adam Wolff 2025
TwitterGithub

Building Your Very Own Voice Assistant

Voice assistants have become ubiquitous, but most rely entirely on cloud services for speech processing. What if you could run the Speech to Text(STT) and Text to Speech(TTS) models locally and leverage a single cloud service for the chat model? This article goes through some of the hardware and services needed to have your very own (mostly) localized voice assistant.

Hardware

In order to interface with our models over voice, the Home Assistant team has made a very approachable piece of hardware called the Home Assistant Voice PE. It retails for about $60 and has everything you need to talk to your system.

The other thing we need is a place to run Home Assistant. Since we are offloading the heavier Conversation Agent/LLM to a cloud service (Groq), the hardware requirements are pretty minimal. Something like an old Raspberry Pi 4 with 4GB+ RAM works fine. For sub-second response times, you'll want an Intel-based system like an old laptop or NUC with a Core i5 or better.

That being said you can also run the Conversation Agent locally but the requirements are a little more intensive. You could run something like qwen2.5:8b on more modest hardware (a 3060 with 12GB VRAM would work), but the inference speed on consumer GPUs is much slower than cloud providers. According to recent benchmarks, an RTX 3060 gets ~38 tokens/second on 8B models, while even an RTX 4090 tops out around 90-130 tokens/second—compared to ~500 tokens/second on Groq. For a voice assistant where response time matters, that's the difference between a snappy 1-2 second reply and an awkward 5-10 second pause. Luckily if you don't have a gaming machine lying around we can lean on Groq, which has a very approachable pricing tier and fast inference speeds.

Architecture Overview

The system orchestrates multiple Docker containers, each handling a specific responsibility:

┌─────────────────────────────────────────────────────────────────────────────┐
│                          Voice Assistant Stack                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌─────────────────┐                                                        │
│  │ Voice Assistant │                                                        │
│  │       PE        │                                                        │
│  └────────┬────────┘                                                        │
│           │                                                                 │
│           ▼                                                                 │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                         Home Assistant                              │    │
│  │                                                                     │    │
│  │   ┌─────────┐    ┌───────────────────────┐    ┌───────────┐         │    │
│  │   │ Whisper │───▶│ Extended OpenAI       │───▶│   Piper   │         │    │
│  │   │  (STT)  │    │ Conversation          │    │   (TTS)   │         │    │
│  │   └─────────┘    └───────────┬───────────┘    └───────────┘         │    │
│  │                              │                                      │    │
│  └──────────────────────────────┼──────────────────────────────────────┘    │
│                                 │                                           │
│            ┌────────────────────┴────────────────────┐                      │
│            │                                         │                      │
│            ▼                                         ▼                      │
│    ┌──────────────┐                    ┌──────────────────────────┐         │
│    │  Groq Cloud  │                    │  Web Service             │         │
│    │  (Convo      │                    │  (handles tool calls)    │         │
│    │   Agent)     │                    │  ┌────────────────────┐  │         │
│    └──────────────┘                    │  │ SearXNG (queries)  │  │         │
│                                        │  └────────────────────┘  │         │
│                                        └──────────────────────────┘         │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Why Groq?

Groq offers several advantages for this use case:

  1. Speed: Their LPU (Language Processing Unit) architecture delivers inference times under 500ms for short responses
  2. Free Tier: 30 requests/minute, 14,400 requests/day, and 500K tokens/day for llama-3.1-8b-instant — more than enough for personal use (20-50 queries/day won't come close)
  3. OpenAI-Compatible API: Easy integration with Extended OpenAI Conversation
  4. Model Selection: Access to Llama 3.1 models optimized for instruction following

The llama-3.1-8b-instant model provides excellent quality-to-speed ratio for voice assistant responses.

How It Works

When you speak to the assistant:

  1. Whisper converts your speech to text
  2. Home Assistant/Extended OpenAI Conversation sends the text to Groq's LLM
  3. The LLM decides which tool (if any) to call based on your request
  4. The tool executes on the HTTP Service and returns a response
  5. The response flows back through the LLM for natural language formatting
  6. Piper converts the response to speech

Software

Everything for this stack runs through 3 services:

  • Home Assistant
  • Whisper
  • Piper

This will handle interfacing with the Voice Assistant but for tool calling we are going to utilize a web service and a Home Assistant custom component called Extended OpenAI Conversation. That will allow us to define different tools to call as the Conversation Agent processes the intent from our parsed voice inputs. This will allow us to define tool calls like a web search like this:

- spec:
    name: search_web
    description: >-
      Search the internet for real-time information (prices, news, sports, stocks).
      Return ONLY the exact text from this tool. Do not add notes, disclaimers, or explanations.
    parameters:
      type: object
      properties:
        query:
          type: string
          description: The search query to find information
      required:
        - query
  function:
    type: rest
    method: POST
    resource: 'http://YOUR_SERVER_IP:8765/search'
    headers:
      Content-Type: application/json
    timeout: 60
    payload_template: '{{ {"query": query} | to_json }}'
    value_template: '{{ value_json.message }}'

This gives us an extensible way to update tool definitions inside of Extended OpenAI Conversation and their functionality into a simple Web Service API.

Docker-Compose

The basic docker-compose configuration would look like:

services:
  searxng:
    image: searxng/searxng:latest
    container_name: voice-assistant-searxng
    ports:
      - "8080:8080"
    volumes:
      - ./data/searxng:/etc/searxng:rw
    restart: unless-stopped

  homeassistant:
    image: ghcr.io/home-assistant/home-assistant:stable
    container_name: voice-assistant-homeassistant
    network_mode: host
    volumes:
      - ./data/home-assistant:/config
      - ./home-assistant.log:/config/home-assistant.log
      - /etc/localtime:/etc/localtime:ro
    environment:
      - TZ=America/Chicago
    restart: unless-stopped
    privileged: true

  whisper:
    image: rhasspy/wyoming-whisper:latest
    container_name: whisper
    ports:
      - "10300:10300"
    volumes:
      - ./data/whisper:/data
    command: --model tiny
    restart: unless-stopped

  piper:
    image: rhasspy/wyoming-piper:latest
    container_name: piper
    ports:
      - "10200:10200"
    volumes:
      - ./data/piper:/data
    command: --voice en_US-lessac-medium
    restart: unless-stopped

Configuring Home Assistant

Once the containers are running, you'll need to configure Home Assistant to wire everything together.

1. Add Whisper and Piper Integrations

Go to Settings → Devices & Services → Add Integration and add:

  • Whisper: Enter your server IP and port 10300
  • Piper: Enter your server IP and port 10200

2. Install Extended OpenAI Conversation

This custom component isn't in the default HACS repository, so you'll need to add it manually:

  1. Install HACS if you haven't already
  2. Go to HACS → Integrations → Three dots menu → Custom repositories
  3. Add https://github.com/jekalmin/extended_openai_conversation as an Integration
  4. Search for "Extended OpenAI Conversation" and install it
  5. Restart Home Assistant

3. Configure Extended OpenAI Conversation

Go to Settings → Devices & Services → Add Integration → Extended OpenAI Conversation and configure:

  • API Key: Your Groq API key (get one at console.groq.com)
  • Base URL: https://api.groq.com/openai/v1
  • Model: llama-3.1-8b-instant

After adding, click Configure on the integration to add your tool definitions (like the search example above).

4. Create a Voice Assistant

Go to Settings → Voice assistants → Add Assistant and configure:

  • Name: Whatever you want to call it
  • Conversation agent: Select your Extended OpenAI Conversation integration
  • Speech-to-text: Select Whisper
  • Text-to-speech: Select Piper

5. Add the Voice PE Device

Power on your Voice PE and it should appear in Home Assistant automatically via the ESPHome integration. Once discovered:

  1. Go to Settings → Devices & Services and adopt the Voice PE device
  2. Go to the device page and select your voice assistant pipeline under Configuration

You should now be able to speak to the Voice PE and have it route through Whisper → Groq → Piper.

HTTP Service

You can setup whatever HTTP service you like for handling the tool calls. I've setup a simple node service using fastify.

import Fastify from 'fastify'

const GROQ_API_KEY = process.env.GROQ_API_KEY
const SEARXNG_URL = process.env.SEARXNG_URL || 'http://localhost:8080'

interface SearchResult {
  title: string
  url: string
  content: string
}

interface SearxngResponse {
  results: SearchResult[]
}

interface GroqResponse {
  choices: Array<{
    message: { content: string }
  }>
}

async function search(query: string): Promise<string> {
  // Step 1: Query SearXNG metasearch engine
  const searchResponse = await fetch(
    `${SEARXNG_URL}/search?q=${encodeURIComponent(query)}&format=json`
  )
  const searchData = (await searchResponse.json()) as SearxngResponse
  const results = searchData.results.slice(0, 5)

  if (results.length === 0) {
    return "I couldn't find any relevant information for that query."
  }

  // Step 2: Build context from search results
  const context = results
    .map((r, i) => `[${i + 1}] ${r.title}: ${r.content}`)
    .join('\n\n')

  // Step 3: Summarize with Groq LLM
  const groqResponse = await fetch(
    'https://api.groq.com/openai/v1/chat/completions',
    {
      method: 'POST',
      headers: {
        Authorization: `Bearer ${GROQ_API_KEY}`,
        'Content-Type': 'application/json',
      },
      body: JSON.stringify({
        model: 'llama-3.1-8b-instant',
        messages: [
          {
            role: 'system',
            content: 'Answer in EXACTLY 1-2 short sentences. Be concise and conversational, suitable for voice output.',
          },
          {
            role: 'user',
            content: `Based on these search results, answer: "${query}"\n\n${context}`,
          },
        ],
        temperature: 0.3,
        max_tokens: 150,
      }),
    }
  )

  const completion = (await groqResponse.json()) as GroqResponse
  return completion.choices[0].message.content
}

const fastify = Fastify({ logger: true })

fastify.get('/health', async () => ({ status: 'ok' }))

fastify.post<{ Body: { query: string } }>('/search', async (request, reply) => {
  const { query } = request.body

  if (!query) {
    return reply.status(400).send({ error: 'query is required' })
  }

  try {
    const message = await search(query)
    return { status: 'success', message }
  } catch (error) {
    fastify.log.error(error)
    return reply.status(500).send({ status: 'error', message: 'Search failed' })
  }
})

fastify.listen({ port: 8765, host: '0.0.0.0' })

The service runs alongside the other containers. Here's the Dockerfile:

FROM node:20-alpine
WORKDIR /app
COPY package*.json ./
RUN npm install
COPY . .
RUN npm run build
EXPOSE 8765
CMD ["npm", "start"]

And add it to your docker-compose:

http-service:
  build: ./http-service
  container_name: voice-assistant-http-service
  ports:
    - "8765:8765"
  environment:
    - GROQ_API_KEY=${GROQ_API_KEY}
    - SEARXNG_URL=http://searxng:8080
  depends_on:
    - searxng
  restart: unless-stopped

You can test the endpoint directly:

curl -X POST http://localhost:8765/search \
  -H "Content-Type: application/json" \
  -d '{"query": "what is the weather in New York"}'

The full HTTP service code is available in the example repository.

Conclusion

Building a voice assistant on a Raspberry Pi demonstrates that practical AI applications don't require expensive cloud infrastructure for every component. By running speech processing locally and strategically using cloud APIs for LLM inference, you can create a responsive, privacy-respecting system.

The Home Assistant and Extended OpenAI Conversation integration is the key that unlocks this flexibility - it lets you define arbitrary tools that the conversation agent can invoke, turning simple REST APIs into voice-controlled capabilities. The modular architecture makes it straightforward to add new features, and the tool specification format provides a clean contract between the LLM and your services.

Whether you're automating your home, learning about AI integration, or just want a voice assistant you control, this stack provides a solid foundation.

For a complete, runnable example with all the code from this post, check out the voice-assistant-example repository.

Resources

  • Extended OpenAI Conversation
  • Wyoming Protocol Documentation
  • Home Assistant Voice Documentation
  • Groq API Documentation
  • SearXNG Documentation