How to Increase the Accuracy of Automatic Captions

Closed captioning is the display of text on a video player to visually communicate spoken dialogue. The audience for closed captioning is typically individuals who  are hearing impaired. Additionally, captions can be used to display translation to non-native speakers. Many video producers and instructional technology professionals are tasked with adding closed captions to all their video content and have run into two major challenges: it is expensive and it is time consuming. Luckily, Ensemble Video offers an automatic caption service that allows quick, easy and affordable creation of closed captions. The challenge with automatic captions is the accuracy rate can vary based on the audio quality, so how do you increase the accuracy of automatic captions? The answer is simple: you need to ensure you are recording high-quality audio. This article will provide several tips to increase the audio quality of your videos so you can increase the accuracy of automatic captions.

Audio is the Most Important Component of Video


Is audio the most important component of video? We can argue that point all day, but we do know that audio is one of the most important components of video. If you consider the most common types of videos uploaded into the Ensemble Video platform are produced to support teaching, learning and communication, good quality audio is critical to the success of the video. Additionally, many of the videos in Ensemble are being automatically captioned by our automatic captions service, so good quality audio will ensure a higher accuracy rate for automatic captions. Besides the decreased accuracy rate, many issues come along with poor audio quality. Viewers may struggle to hear the video, or they might not understand the speech at all. In both situations, the viewer will give up or they won’t learn anything from the video. If you are a media producer or creating your own content, you know all the work that goes into making a great video, so you cannot leave your students or audience hanging with poor quality audio.

Why You Should Caption Your Video

It is also important to mention the fact that many schools, colleges and organizations have hearing impaired students and providing accessible closed captioned video for their viewers is required. Additionally, when captions are added to a video, viewers can search through captions and jump directly to points of interest inside your videos.  These are both excellent reasons to caption content, but captioning isn’t cheap.


Closed Captioning Example

Many of the closed caption services charge $60–$175 per hour. If you are producing a lot of video content throughout the year, that can add up. For every 100 hours of video it would cost $17,500 to caption the content at $175 per hour (or $6,000 at $60 per hour). One of the most common solutions to the cost problem is using speech-to-text technology that automatically creates captions for your videos. These automatic captions are generated by machine learning algorithms. Ensemble Video’s automatic captions solution starts at $3 per hour (or $300 for every 100 hours of content).

Many experienced video professionals have tried automatic captioning services in the past, and they know that poor audio quality is one of the biggest challenges to generating highly accurate closed captions. For automatic closed captioning technologies to accurately transcribe the audio information, the audio must be crisp, clear and easy to understand. The problem is, many videos don’t contain great quality audio because of common issues. Audio problems can range from improper mic placement to issues with the surrounding recording environment.  The most common problems with audio quality with eLearning and communications video include:

  • Picking up background noises and unnecessary chatter
  • Equipment sounds (fans, hums, buzzing, air conditioners, etc.)
  • Being too far from the microphone
  • Being too close to the microphone
  • Poor quality microphone
  • Rooms with an echo
  • Audio feedback
  • Wind noise

How To Produce Better Quality Audio?

So, the obvious question is, how do you produce better quality audio to ensure quality automatic captions?  Without going too far into the weeds of audio production, there are three key components to producing good audio: use a quality microphone, place the microphone in the appropriate location and optimize the recording environment.

Quality Microphone

The first step in improving audio quality to ensure accurate automatic captions is using the right microphone for the job. The latest and greatest video cameras and webcams boast the highest image quality and visual capabilities, but that is only half the battle. Many times, the built-in microphone of your camera, computer or mobile device might not be good enough. If you are using the on-board microphone and find that the accuracy of your automatic captions is lower than you like, you might want to invest in a quality microphone. Keep in mind a quality microphone does not mean it is an expensive microphone. There are several microphones that can be purchased for less than $100. Here are a few types of microphones that you may want to consider based on the type of video you are creating.


headsetIf you are creating screencasts and eLearning videos, a decent USB or Bluetooth headset microphone is a valuable tool. You don’t need a studio-level mic to produce great audio quality for your eLearning experience, just something better than the built-in microphone on your computer. Headsets make a huge difference; they are super easy to use and generally inexpensive. Moreover, they can server double duty by also providing better quality when you’re using Skype or Google Hangouts day to day.

Shotgun Mic

shotgunIf you are producing video and you need to be able to focus on a human voice from a distance while eliminating other sounds from the side and rear, a shotgun microphone is a good option. A shotgun mic is a cylindrical, long microphone designed to give it a narrow range of focus, making it a great option for picking up voices in a recording location. You can connect a shotgun microphone into your HD video camera, an iPad/iPhone and an Android.

Omnidirectional Mic

omnidirectionalIf you need to record all of the voices (or sounds) in a room, an omnidirectional microphone is an option to consider. Omnidirectional microphones pick up sound equally from all sides or directions of the microphone. This can be very useful in applications where sound needs to be recorded from multiple directions. Some great use cases for an omnidirectional microphone are round table discussions and/or multi-speaker productions where sound will be coming from the front, back, left or right side of the microphone. With that said, picking up sound from all directions may be undesirable. An example of this is recording a professor’s lecture in a classroom. In a scenario where only the lecture of the professor needs to be recorded without any noise that may be coming from the students behind, an omnidirectional microphone should not be used if you do not want to capture audio from the students.

Unidirectional Mic

unidirectionalIf you are in a scenario where only the voice of the professor in the front of the room needs to be recorded in a lecture hall without any noise that may be coming from the students behind, a unidirectional microphone is a perfect fit. Unidirectional microphones are microphones that are designed to focus their sound pick-up from a specific side or direction of the microphone. The only issue with this type of microphone is the sound must be coming from the correct side of the microphone, typically referred to the “voice side” of the microphone to ensure quality sound for your recording.

Lapel or Lavalier Mic

lavalierA lavalier microphone is attached to a subject’s clothing to record audio from a single speaker. Lavalier mics are less noticeable than larger microphones, and a wireless lapel mic can give a person hands-free freedom of movement while maintaining consistent audio quality. Lavalier microphones—also known as lav, lapel or lap microphones—are specialized microphones; they are ideal for recording a single speaker, but they are not as versatile as shotgun microphones and directional microphones.

Placement of Microphone

Generally, the golden rule of microphone placement is to get the distance right based on the type of microphone that you are using. In general, if you are trying to produce great quality audio to increase the accuracy of automatic captions, place the microphone as close as practical to the sound source without getting so close that you introduce unwanted effects. A mic placed too close to a speaker will produce unwanted pops and booms, and might be a nuisance for the speaker. A mic placed too far away will produce audio that is undesirably quiet.

mic_placementTry to achieve a good balance between the subject’s voice and the ambient noise. For example, if you are recording an interview with a lavalier microphone, you will need to place the microphone about 6 inches from the subject’s mouth and ensure their very little ambient noise. If you are recording a lecture in a classroom using a unidirectional microphone, you will want to make sure the voice side of the microphone is close to the instructor to guarantee a quality audio recording without picking up the surrounding noise from students, air conditions, projectors, etc. What is close? Having the unidirectional microphone within a few feet of the lecturer would be good; you certainly don’t want to place it outside the range of the microphone (for example, 15+ feet)


Finally, please don’t be misled—closer is not always better; it is possible to get too close. Here are some examples:

  • If a lavalier microphone is too close to the speaker’s mouth, the audio may be unnaturally boomy, so you are also likely to experience popping and other unpleasant noises.
  • A unidirectional microphone that is too close to loud sound sources is likely to create distorted and/or “hot” audio. Have you ever seen a guest speaker “eat the mic” and hear the effects from that? You’ll likely get loud feedback in an amplified environment, and the viewers will be covering their ears or turning down the volume on the player.
  • Placing a mic too close to moving parts, rustling papers or mechanical sounds (projectors, computer fans, or air conditioners) may produce negative results. For example, if you place an omnidirectional microphone next to an air conditioner you run the risk of getting too much air and not enough voice.

The Recording Environment

In an ideal world, you have a recording studio where you can control all the sound which will turn into quality automatic captions. But since you likely don’t work for a movie production company or have access to a quality recording studio, you should find a way to optimize the recording environment. Optimizing the environment means eliminating all potential background noise (and potentially making some modifications to soften the room). For example, avoid places that have loud HVAC systems, are near dinging elevators or microwaves, have barking dogs or audible street noises, or are very reverberant. Rooms that don’t have carpeting or anything on the walls are likely to have significant echoing. If you want to try and soften a reverberant room, bring in things like couch cushions, moving blankets, or sound abortions panels.

Control the Equipment and Environment to Increase the Accuracy of Automatic Captions

closedcaptioning-smThe more you control the equipment and recording environment, the better audio you can produce, which will increase the quality of your automatic captions. The recommendations suggested above should help you develop a strategy for recording your eLearning videos and communications content. You do not have to be a millionaire or an expert—just record in an appropriate location, choose the right microphone for the recording and place the microphone properly. You will be able to produce great audio and increase the accuracy of automatic captions. Don’t forget, after the automatic captions are created, you can edit them in the integrated Amara Caption Editor to make all your content accessible and searchable with Ensemble Video’s in-video search feature!

Start Captioning for $0.05/min or Less!