Voice Dictation15 min read

What Is Voice Activity Detection (VAD) and Why Does It Matter? — Rota AI

Karthik Krishnan
Karthik KrishnanFounder
February 2, 202615 min read

What Is Voice Activity Detection (VAD) and Why Does It Matter?

TL;DR: Voice activity detection (VAD) is the part of a voice app that decides whether you are actually talking or not. It sits between your microphone and the transcription model. Without it, your dictation app would try to transcribe your fan, your dog, and every breath you take. Rota AI uses Silero VAD, an open source model that runs locally. The threshold is configurable. Lower catches more speech but also more noise. Higher is picky but cleaner. Here is everything you need to know.


The Problem VAD Solves

Here is something most people never think about. When you open a voice dictation app, your microphone is always listening. Always. Every sound in your room is getting captured and digitized and processed.

Your laptop fan spinning. Your roommate watching YouTube. You shifting in your chair. You breathing.

If the app tried to transcribe all of that, the output would be absolute garbage. Pages of random words from your fan noise. Your dog barking becoming "the door is talking." It would be unusable.

VAD exists to solve exactly this problem. It is the gatekeeper. It decides which chunks of audio are actual human speech and which ones are just noise. Only the speech chunks get sent to the transcription model. Everything else gets discarded.

Think of it like a bouncer at a club. Your voice gets in. Your fan does not.

I did not fully appreciate how important VAD was until I started building Rota AI. My first prototype did not have VAD at all. It just sent everything to Whisper. The results were hilarious and also completely useless. My air conditioner was "speaking" more than I was.


What Is Voice Activity Detection (VAD), Exactly?

Voice activity detection is a signal processing technique that determines whether a segment of audio contains human speech. That is it. That is the whole job.

The input is raw audio. The output is a simple yes or no. Speech or not speech. Some systems output a probability instead of a binary decision, which is more useful in practice because you can tune how aggressive the detection should be.

VAD has been around for decades. It was originally developed for telephony and VoIP systems. When you are on a phone call, VAD helps the system stop transmitting audio when nobody is talking, which saves bandwidth. The same idea applies to voice dictation, except instead of saving bandwidth, you are saving transcription compute and avoiding garbage output.

There are different approaches to VAD:

Energy based VAD is the simplest. It measures the volume of the audio. If the volume goes above a threshold, it assumes someone is talking. This works ok in quiet environments but falls apart fast when there is background noise. A loud fan could trigger it. A whisper might not.

Frequency based VAD looks at the spectral characteristics of the audio. Human speech has specific frequency patterns that are different from most background noise. This is more robust than energy alone but still not great in noisy environments.

Machine learning based VAD is the modern approach. A neural network is trained on thousands of hours of audio to distinguish speech from non speech. This is what most modern voice apps use, including Rota AI. It handles background noise, different accents, quiet speech, and all the edge cases that trip up simpler methods.


How Rota AI Uses Silero VAD

Rota AI uses Silero VAD. It is an open source voice activity detection model that runs entirely on your machine. No cloud. No API calls. No latency from network round trips.

Silero VAD is fast. Like really fast. It can process audio faster than real time on a CPU. That means it does not add any noticeable delay to the dictation pipeline. You talk, VAD processes the audio in milliseconds, and the speech segments get queued for transcription.

Here is how it fits into the Rota AI pipeline:

  1. Audio capture continuously buffers microphone input
  2. Each buffer chunk gets passed to Silero VAD
  3. Silero returns a probability (0 to 1) that the chunk contains speech
  4. If the probability is above the threshold, the chunk is kept
  5. If it is below, the chunk is discarded
  6. When speech ends, the kept chunks are concatenated and sent to Whisper for transcription

The default threshold in Rota AI is 0.5. That means if Silero thinks there is a 50% or higher chance the audio contains speech, it counts. This works well for most people in most environments.

But here is the thing. The right threshold depends on your specific setup. Your microphone. Your room. Your speaking volume. There is no universal perfect value.


How to Tune the VAD Threshold

This is where it gets practical. If you are having issues with Rota AI, the VAD threshold is the first thing I would check.

Lowering the threshold (try 0.3 or 0.4):

Use this if Rota AI is cutting off the beginning of your sentences. Or if it is missing words when you speak softly. A lower threshold means the system is more permissive. It will let more audio through, including quieter speech.

The tradeoff is it will also let more background noise through. If you are in a coffee shop, a low threshold might pick up the espresso machine and try to transcribe it. your mileage may vary depending on how noisy your environment is.

Raising the threshold (try 0.6 or 0.7):

Use this if Rota AI is picking up background noise and transcribing things you did not say. Or if it is not detecting when you stop talking, so there is a long pause before the text appears.

A higher threshold means the system is stricter. It only counts audio as speech if it is pretty sure. The tradeoff is it might miss quiet words or the beginning of sentences where your voice is still ramping up.

My recommendation: Start at 0.5. If words get cut off, lower it by 0.1. If you get noise transcription, raise it by 0.1. Small adjustments make a big difference. I spent an entire weekend testing different thresholds when I was building Rota AI and the difference between 0.4 and 0.5 was surprisingly significant.

You can change the threshold in Rota AI settings. It takes effect immediately. No restart needed. Just toggle dictation off and on and test it out.


Common VAD Issues and How to Fix Them

Issue 1: First Word Gets Cut Off

This is the number one complaint I get. You start speaking and the first word or two is missing from the transcription. "Send a message to Karthik" becomes "a message to Karthik."

What is happening: VAD detected your speech a few hundred milliseconds after you actually started talking. By the time it flipped to "yes, this is speech," you were already on the second or third word.

The fix: Rota AI has a pre buffer setting. It keeps the last 200 milliseconds of audio in memory and prepends it to the start of each speech segment. This catches the beginning of your sentence that VAD might have missed. Make sure this is enabled in settings.

Also try lowering the VAD threshold slightly. If you start sentences softly (like most people do), a high threshold will miss the onset.

I dealt with this issue for weeks. It was driving me crazy. I would say "Hey Rota" and it would transcribe "Rota." Just "Rota." The pre buffer fixed it for the most part. It is not perfect. If you start with a very quiet breathy sound, even the pre buffer might not catch it. But for normal speech, it works well.

Issue 2: Background Noise Gets Transcribed

Your fan, your air conditioner, traffic outside, your mechanical keyboard. VAD thinks these are speech and sends them to Whisper. Whisper, being the overachiever it is, tries to transcribe them. You get random words that you never said.

What is happening: The noise is loud enough or speech like enough that VAD classifies it as speech. This is especially common with consistent background noise that has some frequency overlap with human voice.

The fix: Raise the VAD threshold. This is the most direct solution. Also, if your noise is consistent (like a fan), some VAD models can actually adapt to it over time. Silero is pretty good at this but it is not magic.

A better microphone helps a lot here. A directional mic that picks up sound primarily from your mouth and rejects sound from other directions will give VAD a much cleaner signal to work with. I switched from my laptop mic to a cheap Fifine USB condenser and the background noise issue dropped by like 70%.

Issue 3: VAD Does Not Detect When You Stop Talking

You finish speaking and nothing happens. The app just sits there waiting. After a few seconds, the text finally appears. It feels sluggish.

What is happening: VAD is not confident that you have stopped talking. Maybe there is background noise that it is interpreting as continued speech. Or your trailing words were too quiet.

The fix: There is usually a silence timeout setting. This is the amount of silence Rota AI waits for before deciding you are done speaking. If it is set too high, you will feel that lag. Try reducing it. The default is usually around 1 to 2 seconds of silence.

Also check your VAD threshold. If it is too low, background noise might be keeping the "speech detected" flag active even after you stop talking.

Issue 4: Choppy Transcription in Noisy Environments

You are in a busy office or a coffee shop. The transcription keeps starting and stopping. You get fragments instead of full sentences.

What is happening: VAD is rapidly toggling between speech and not speech because the background noise is right around your threshold level. Every time someone walks by or a cup clinks, VAD thinks you started talking again.

The fix: Raise the threshold to be above the noise floor. Use a headset mic if possible. Headset mics are the best for noisy environments because they are right next to your mouth and pick up way less ambient sound.

TBH if you are in a really loud environment, no VAD is going to save you completely. The signal to noise ratio is just too low. I have tried using Rota AI on a busy train and it was rough. Not the VAD's fault. Physics is physics.


Why VAD Matters for Dictation Accuracy

People think dictation accuracy is all about the transcription model. Whisper is good so the dictation must be good. But that is only half the story.

VAD accuracy directly impacts transcription accuracy. Here is why:

If VAD cuts off the beginning of a sentence, Whisper gets incomplete audio. It might guess what the missing words were but it is just guessing. The transcription will be wrong.

If VAD lets background noise through, Whisper tries to transcribe it. It will output words that were never spoken. Now your dictation has hallucinations that you have to manually delete.

If VAD does not detect silence correctly, sentences might get merged together or split at weird points. This messes up the cleanup step too, because the AI cleanup model expects properly segmented speech.

I would argue that VAD is the most underappreciated part of the voice dictation pipeline. It is not glamorous. Nobody writes blog posts about it. Well, until now I guess. But getting VAD right makes a bigger difference in daily use than upgrading from Whisper small to Whisper medium.


A Quick Story About VAD Gone Wrong

When I was first testing Rota AI at my desk, I had a small desk fan running. Nothing crazy. Just a gentle breeze.

Every time I stopped talking, Rota AI would transcribe about two seconds of "the sound continues" or "background present" or some other nonsense. The fan noise was right at the edge of the VAD threshold. Sometimes it counted as speech, sometimes it did not.

I kept raising the threshold but then it started cutting off my actual words. I was stuck. Too low and the fan talks. Too high and I lose words.

The solution was embarrassingly simple. I moved the fan behind me instead of next to me. The mic picked up less fan noise and suddenly the VAD threshold worked perfectly. No code changes. No settings tweaks. Just moved a fan.

Sometimes the best engineering solution is the dumbest one. I should have thought of that before spending three hours tweaking threshold values.


VAD vs Other Approaches

You might be wondering: do you even need VAD? Can you just transcribe everything and filter the results after?

Technically yes. But it is a bad idea for several reasons:

Cost. Transcription costs money if you are using a cloud API. Sending silence and noise to the API means you are paying for nothing. With local models, you are wasting CPU cycles. Either way, it is wasteful.

Speed. Transcribing silence takes time. If half your audio is silence, you are doubling the transcription time. VAD lets you skip the silence entirely.

Accuracy. Whisper is good but it is not perfect. When you feed it silence, it sometimes hallucinates words. Better to not give it the chance.

Latency. For real time dictation, every millisecond counts. VAD is nearly instant. Transcription is not. Skipping transcription on non speech audio keeps the system responsive.

Some apps use a push to talk button instead of VAD. You hold a key while you speak. That works but it defeats the purpose of hands free dictation. The whole point is you should not have to touch your keyboard. VAD enables true hands free operation.


Frequently Asked Questions

What is voice activity detection in simple terms? VAD is a system that listens to your microphone and decides whether you are talking or not. It filters out silence and noise so only your actual speech gets transcribed.

Does Rota AI use VAD? Yes. Rota AI uses Silero VAD, an open source machine learning model that runs locally on your computer. It is fast, accurate, and does not require an internet connection.

What is the best VAD threshold for Rota AI? The default is 0.5 and it works for most people. If words get cut off, try 0.4. If you get background noise in your transcription, try 0.6. Small adjustments make a big difference.

Why does Rota AI cut off the first word I say? This is a VAD timing issue. Enable the pre buffer in settings. It keeps a small chunk of recent audio and prepends it to catch the beginning of your speech. Also try lowering the VAD threshold slightly.

Can I disable VAD in Rota AI? You can effectively disable it by setting the threshold to 0.0, which means everything counts as speech. I do not recommend this. Your transcription will be full of garbage from background noise.

Does VAD work with accents? Yes. Silero VAD is trained on diverse audio data and handles different accents well. It is detecting speech patterns, not specific words or pronunciations.

Does VAD use a lot of CPU? No. Silero VAD is extremely lightweight. It uses a tiny fraction of your CPU. You will not notice any performance impact.

What microphone works best with VAD? Any microphone works, but a directional mic or headset mic gives VAD the best signal to work with. The cleaner the input, the better VAD can distinguish speech from noise.

Is VAD the same as noise cancellation? No. Noise cancellation tries to remove background noise from the audio signal. VAD decides whether the audio contains speech at all. They are complementary technologies. Rota AI uses VAD. Some microphones or audio drivers handle noise cancellation separately.


Final Thoughts

Voice activity detection is one of those things you do not notice until it goes wrong. When it works, you just talk and text appears. When it does not work, you get cut off words, background noise transcriptions, and a lot of frustration.

The key takeaway is this: VAD is not set and forget. Your environment changes. Your microphone placement changes. Your speaking volume changes. The threshold that worked yesterday might not work today. Knowing that VAD exists and knowing how to tune it will make your voice dictation experience significantly better.

If you are using Rota AI and something feels off with your dictation, check the VAD settings first. It is the most common culprit and the easiest to fix.

And if all else fails, move the fan.


This post is part of the Rota AI blog. VAD is one of those topics that does not get enough attention, so I wanted to give it the deep dive it deserves. If you found this useful, check out my post on how AI voice dictation works, which covers the full pipeline from microphone to text on screen.

About the Author
Karthik Krishnan
Karthik KrishnanFounder

Founder & Developer

I built Rota because I didn't have $15 to pay for a dictation tool per month, so I built my own.

Related articles