I have been interested in the voice control of computers for a long time. My first attempt was around 10 years ago, and I had some success with it. In the right environment, I was able to say commands to my computer and it would respond based on what I said. The problem was that I didn’t have a practical use for it yet. It was clear in this early testing that using a keyboard and mouse was far more convenient, reliable and a quicker option than using voice. It will remain that way for many of the standard interactions (i.e email, facebook) we have with computers, at least in the short term.
The day Microsoft Kinect was launched in Australia, I saw the promotional video showing people waving their arms around to navigate through their media centre. It seemed to me that this would be a fairly unreliable and exhausting way to control anything, apart from games specifically designed for the technology. I was way too lazy to consider using this technology into the future.
I concluded that voice is the simplest way to control anything, and that it always will be. This led me to start playing around with voice control again. I ran through the voice tutorials and was able to get the computer to understand my voice some of the time. It did stuff up on me a whole lot, but it was clearly much more reliable than software I had used in the past.
Now around 6 months on, I have written an AutoHotkey script and a WSR macro that interact with Windows Media Center and Windows Speech Recognition software, allowing my media centre to be controlled completely by voice. This is a practical use for voice control. I can navigate faster with my voice than I can with a remote control. Instead of needing to know which button to press on my remote (or remotes), I simply speak my mind. I no longer use a remote at all. This is something I have wanted for a long time and I am excited about this outcome.
This system far exceeds any other voice control setup on the market today in terms of reliability and practicality. Most of the problems as to why systems haven’t worked in the past has not been because the software was inadequate for the task, (the software has worked fine for many years). Most of the problems are environmental, and my solution tackles these environmental issues. Rather than trying to make technology that works in our environment, my solution changes the environment to enable the technology to work. I believe it is inevitable that all future voice control systems will need to take this approach for the system to work.
This article will give you all the information you need to control your Windows Media Center home theatre PC with your voice. I will provide the easy to edit scripts and show you how to install them on your PC. I will also explain what works and what doesn’t, as well as explaining why previous attempts have not been successful. The more I explain how it all works, the easier it will be for you to set it up and get it working reliably. This will not be as easy as installing the software and having the results you want right away. You will need to train it to recognise your voice, and you will need to learn the correct commands to make. A solution that can understand the whole English language is a long way off. It is much more difficult to synthesize human understanding than it is for a computer to understand dictation. That is why we need to have set commands.
There is a video of my home theatre PC running this system after the jump.
What you need: The costs
- A Windows 7 PC running Windows Media Center 7
- A decent microphone. I have been using this one from Logitech which retails for $50, though it is not hard to find cheaper. This microphone picks me up from anywhere in the room and has noise cancelling features. I plan on putting in boundary microphones in the future.
How it works:
This system uses Windows Speech Recognition(WSR) to understand the commands we say. This is free software that is built into every Windows 7 PC. WSR Macros runs in parallel with WSR to perform actions based on what we say. I used this to create voice tags which link in with keyboard shortcuts. i.e. Saying “Stop TV” will send the keystroke Ctrl+s. Because keyboard shortcuts are being used, the xml files could easily be altered to work with XBMC or Media Portal.
The second component is what makes my system work. The AutoHotKey script waits for a predefined button press (0 or [). When that button is pressed, the volume of the media drops to a level that WSR will not pick up enough sound to misinterpret as a command. It then opens up communication to WSR welcoming our commands as we talk at a natural volume.
When we have finished saying our commands, we can either say “Play TV” or press the button again. This will turn off WSR, put the volume back to a suitable listening level and then resume playback of our media.
Using this method, we don't care how loud the volume is when the media is playing. It makes no difference if the stereo is on full blast because the trigger button will drop the level to a volume that we can talk comfortably to the system. I have been unable to find any products or patents that do this, and it is a necessary step for voice control to work. The alternative is to yell over the sound coming out the speakers, and at the same time hoping that WSR doesn't pick up music as commands.
Windows Media Center has voice control built in, but it suffers the same problems as all the other products on the market. The volume needs to be impractically low for it to work. These built in commands do not clash with the commands built into my system. They work in conjunction with each other, and they only work when a trigger has been pressed.
All of the software is free.
Please don't be put off by the list of programs below. Each of them are necessary for the system to work, and they are all fairly straightforward to setup. The format they are in allows you to easily make any changes to the voice tags. The system is very open so you can use it as you want.
- Windows Speech Recognition: This software is available from the Speech Recognition Applet within your Control Panel. I feel it has never been given the props it deserves. I can't ask for much more than 100% accuracy when using this system with a headset. It does have a requirement that you use the English (US) language, so if your system is currently running a different language, you will need to change it in the Regional and Language Applet of the Control Panel.
- WSR Macros: This program works with WSR. It lets us install xml scripts which define the commands the computer will listen for. WSR Macros can be downloaded for free from Microsoft.
- AutoHotkey: This is not required, but it is what I wrote the script in. If you want to make changes to the script, you will need this program. It is available from Autohotkey.com.
- The following programs and scripts are included in Voice Control.zip file.
- Nircmd.exe: It is used by the script to change the system volume. It is truly a fantastic piece of freeware.
- Speech Recognition Control.exe & Speech Recognition Control - No Eventghost.exe: This is a compiled version of Speech Recognition Control.ahk. This manages the trigger function and volume changes. Use No Eventghost if you do not yet have eventghost installed.
- Media Center.xml: This script contains the majority of the commands I have setup for WSR Macros.
- Sydney TV Channels.xml: This is a sample script which looks after TV channel control.
To launch Windows Speech Recognition, go into your Control Panel and launch the Speech Recognition Applet. There is not much to setup in this software. Within the options, you will probably want to tick the setting to load it at startup. I also turned off audible feedback and the dictation scratchpad.
Download and install WSR Macros. You will need to have this loading at startup. The easiest way to do this is to make a shortcut to it in the Startup folder of your start menu. At the same time, unzip the zip file to the Speech Macros folder within your Documents folder. This folder will automatically be created when WSR Macros is installed.
The options for both WSR and WSR Macros can be found in your task tray near your clock.
Copy and paste the entire Media Center.xml into the text box. Click Next.
You can call the file anything you want, but I call mine Media Center.WSRMac.
Ensure you digitally sign the file. This will ensure the script is allowed to run on your computer. This has advantages later on also. The alternative to digitally signing your files is to drop the security level down. I would recommend against reducing your security.
Repeat the steps above for the Sydney TV Channels.xml file. You will probably want to change the xml file to suit your needs. The code is as follows.
The <listenFor> command determines what phrase you say to go to the particular channel. It is best to make it 3 or more syllables i.e. instead of saying “GO”, say “channel go”. You can have as many phrases for each channel as you want.
The <sendKeys> command determines the keyboard button to press. i.e. This example presses 2 on the keyboard which will go to channel 2.
The <sendKeys>\</sendKeys> is necessary as it turns off speech recognition and turns the volume back up to listening volume.
If you are willing, please send me your altered TV Channels.xml files so I can share them among readers to save them from requiring to go through this step.
Finally, create a shortcut in the Startup folder of your Start menu to Speech Recognition Control.exe.
If you have installed the system correctly, after you reboot, you should see a green H icon, WSR icon and WSR Macro icon in your system tray. You are now ready to test it out. The trigger button is either “[” or “0” as determined by Speech Recognition Control.exe. In effect, you should be able to press the 0 button on your keyboard and the system will be triggered. In most cases the 0 button of your remote control will also work.
Training is an essential part of getting this to work. Without going through the training process, this is unlikely to work for you. The process is simple, but it will take each member of your family a bit of time. Each training session takes around 10 minutes.
I needed to make the system understand my Australian accent so I had to be very patient with it. I ran through the training session 3 times in the first week. I then spent a lot of time using the system to have it understand me well. It took around 3 weeks before it really got a hold of my voice. Much of the problems were related to me saying commands that didn't exist yet, but you should not have the same problem if you use the “What can I say sheet”. I am now 6 months into this project and it understands me around 95% of the time (given the right environment). I hope the rest will be solved by using better equipment.
Open up the Speech Recognition applet from the Control Panel. Click on “Train your computer to better understand you”.
Follow the prompts. You will see the screen to your right. Speak the text in the box. Not only will the tutorial learn your voice very effectively, you will learn the many features of Windows Speech Recognition. It is really quite amazing what the system is capable of.
Once the computer understands the line you have spoken, it will progress automatically. I found that it did not progress each time. I would wait 2 seconds and then repeat the line. This felt like the system wanted me to say it again to get another sample of my voice. Some pages I needed to repeat up to 4 times.
It is worth doing this training session a few times over the following days or weeks. Putting in the time early on will save frustration later on.
Once training has taken place, my understanding is that the system will continue to better understand you over time. That is how it has been for me. When the computer does understand you, you are able to transfer your voice profile from computer to computer, so you will only need to go through it once.
Troubleshooting and Tips:
- Many of the problems are likely to come from inadequate training. If you haven't trained it, you may have trouble seeing the system work effectively.
- When editing the xml files, alter the xml file rather than the macro. Ensure you delete unnecessary macros so that any two macros don't conflict.
- Ensure the correct microphone is selected. Using the “Setup your microphone” option within WSR is a good place to start.
- This system uses keyboard shortcuts, so if you are press either [, ], \, 0, * or -, the function will occur rather than the character appearing on the screen. These buttons are rarely used in a media centre setup. A dedicated machine is recommended to avoid this problem. The keys can be changed in Speech Recognition Control.ahk if necessary.
- A reboot is often a sure fire way of getting it going again. I do this weekly, but I needed to do it more frequently when I was going through the training.
When it will work:
- It will work where there is a quiet environment without much background noise.
- It will work if you train it adequately, and continue to improve the more you use it.
- If you live alone, this is likely to work fantastically well for you.
- If you are a quiet and patient person.
- When everyone in the household uses it.
- It will work exceptionally well in a sound proofed room.
When it will not work:
- If there is any background noise i.e. all the noise we have always taken for granted.
- If the shopping is being put away
- If a neighbour is mowing their lawn
- If kitchen appliances are running near your TV room, i.e. dishwasher
- If the rain is loud on the roof
- If someone is eating chips in the TV room
- If you have young kids, the technology is unlikely to work for them or around them.
- When inadequate training has taken place.
- If you yell at it – a natural tone works best. If it isn’t working, there is likely another problem.
- In a corporate environment. Each person would need their own office which would be a good outcome, but also an unrealistic one.
- If your system is busy with other tasks and resources are drained. i.e. My system struggles a little when recording 4 channels at once.
Resolving the problems:
Sound proofing the room is a solid way of fixing some of the problems, with the added benefit that sound won’t escape from the room. Saying this, it won’t be a practical option for many people.
Some of the problems can be controlled somewhat by isolating the rooms which are going to use this technology. We can put noise gates and effects on the intercepted voice, but this is only going to go so far. Sound can only be altered so much, and the human voice uses a huge frequency range, so to cut out any frequencies will affect the incoming signal detrimentally. No matter how long we wait into the future, technology will have very little impact in resolving the majority of these issues.
As I said earlier, we need to create an environment that allows the system to operate. This is likely to impact on the ways that we want to use it. Some considerations will need to be made by people in your household for voice control to work effectively in your home. The majority of the problems are cultural and can be resolved by changing how and when we do things. It will be up to us individually to adapt to the technology to give it every opportunity to work. I believe that using the technology (and making minor alterations in the way we do things and in our homes to accommodate it) will result in very positive outcomes for everyone who grabs it with two hands and runs with it. Time will tell…
This system is much more complicated than simply installing software and watching it do it’s thing. As people are using the system and someone talks to them while the microphone is open, the system will fail. Naturally a little anger may go towards that person. The same will occur as they are using the system towards you. The result could be that we give up on using the system altogether, or we all take a bit of extra time to think before we talk so as to be sure that the system isn’t in use. The only technology I can think that may help this is to install a light that goes on when the microphone is open (On Air). But as far as I can think, that is as close as technology can get to fixing such problems.
Why use voice?:
Keyboards and mice have been necessary tools to help us communicate with computers over the last 50 years. They were created because there was no way to control computers by voice at the time. Now that computers have got to such speeds that enables voice control to work, we can start commanding a computer using our natural form of communication, speech. In time, we will be able to control all electrical devices in our home in much more simple ways.
Each voice tag can perform any task or tasks we want. The result is that we can say one word to set off a series of tasks automatically. A good example is changing volume. If we want to change the volume from 40% to 20%, using a traditional remote, we would need to press the volume down button 20 times. With voice, we can go directly to the volume we want.
Anything a computer can control can be triggered by a voice tag. This has great potential in home automation, and controlling anything electronic around the house. This is better than the current home automation where the switches are relocated to a single panel on a wall. Other tasks can be automated without any human input. This is the start of a fully encompassing system which will make our lives easier.
Speech Recognition vs Voice Control:
Voice control and speech recognition are very different things. Speech recognition is when we expect a computer to recognise anything we say, and then have the computer convert it to text. This is very complicated because our language has many thousands of words and even more combinations of words. Everyone’s voice is different and accents and lisps have a huge impact on what the computer hears. I am yet to see a free program that does this reliably.
Voice control is based upon a set of various predefined commands, where the computer decides which command it should perform based on what it hears us say. This narrows down what the computer expects to hear from us down to a few hundred commands. Because of this, voice control is much simpler to get working, and the results are much more reliable.
How and why my system is different to others:
The main difference between my system and others is the volume drop that occurs before we give commands. This is what makes it all possible and practical. Without this functionality, it will never work in rooms we listen to our media in. It is possible in quieter rooms to have the computer respond to a keyword to open up listening for commands, but as this system will already be in place, I will probably retain it throughout the house as this system develops.
This same volume drop allows for the use of a microphone that is free standing. We don’t need to use a headset for the system to work. This keeps our hands free, which is pretty much the point of using voice in the first place.
It would be very easy to have the computer tell me what it is doing each time I say a command, as most other commercial packages do. But I don’t want a relationship with my computer. I don’t want it to take 5 seconds to tell me it is going to do something I don’t want. When I give a command, I want the computer to perform that command instantly without whingeing. That is why I have not added this feature.
It would be very easy to add in commands for playing genres or artists. The code already exists here (look for Windows Media Player), but yet again it complicates the list of words by adding in thousands of extra commands the computer needs to choose between. The script would need to be altered slightly so it does not clash with my commands. For these reasons I chose to take it out of this script. It will be easy to add this functionality later, but the way I use the system, I don’t see much benefit to it apart from it being a neat trick. (Update: The included Advanced commands script will add in this functionality. I have removed some of the ambiguity requests. It seems to work quite well for me now.)
I have made the commands a little longer than what some other systems use. This helps the software better understand what it is meant to do. i.e. There is not a big difference between the sound of the words play or replay. I have extended them to be Instant Replay and Play TV, which are both very unique sounding commands. This makes the system a whole lot more reliable. Once I have installed adequate sound equipment in my house, it may be possible for the computer to differentiate between the shorter versions, in which case I will change the script to respond to those words. In the meantime our voice profile is learning our voices to give much more reliable responses into the future. (Update: The new script includes both the longer and shorter commands. The longer commands are more reliable, but the shorter commands do work when the computer understands them.)
This system works for me exactly as I have explained in this article. Hundreds and hundreds of hours have gone into these scripts and macros through testing and tweaking. I am hopeful this will work really nicely for you. The format it is in opens doors for easy expandability for future projects. There are many commands which stretch well outside of Windows Media Centers’ normal capabilities. Please keep coming back to Inspect My Gadget to see where this system leads me. There are plenty more projects in the pipeline, and voice control allows what was once my imagination to become a reality.
Remember, most of the problems will be able to be resolved by changing your environment or training the system more. There is a good chance I will delete comments complaining that the software doesn’t work for the first few weeks as it is not helpful to anyone. It took months to learn how to type, and this requires the same effort. If you believe it doesn’t work, I think it is likely you will never have voice control in your home.