By Craig Anderton
As far as I’m concerned, the vocal is the most important part of a song: It’s the conversation that forms a bond between performer and listener, the teller of the song’s story, and the focus to which other instruments give support.
And that’s why you must handle vocals with kid gloves. Too much pitch correction removes the humanity from a vocal, and getting overly aggressive with composite recording (the art of piecing together a cohesive part from multiple takes) can destroy the continuity that tells a good story. Even too much reverb or EQ can mean more than bad sonic decisions, as these can affect the vocal’s emotional dynamics.
But you also want to apply enough processing to make sure you have the finest, cleanest vocal foundation possible—without degrading what makes a vocal really work. And that’s why we’re here.
Vocals are inherently noisy: You have mic preamps, low-level signals, and significant amounts of amplification. Furthermore, you want the vocalist to feel comfortable, and that too can lead to problems. For example, I prefer not to sing into a mic on a stand unless I’m playing guitar at the same time; I want to hold the mic, which opens up the potential for mic handling noise. Pop filters are also an issue, as some engineers don’t like to use them but they may be necessary to cut out low-frequency plosives. In general, I think you’re better off placing fewer restrictions on the vocalist and having to fix things in the mix rather than having the vocalist think too hard about, say, mic handling. A great vocal performance with a small pop or tick trumps a boring, but perfect, vocal.
Okay, now let’s prep that vocal for the mix.
The first thing I do with a vocal is turn it into one long track that lasts from the start of the song to the end, then export it to disk for bringing into a digital audio editing program. Despite the sophistication of host software, with a few exceptions (Adobe Audition and Samplitude come to mind), we’re not quite at the point where the average multitrack host can replace a dedicated digital audio editor.
Once the track is in the editor, the first stop is generally noise reduction. Sound Forge, Adobe Audition, and Wavelab have excellent built-in noise reduction algorithms, but you can also use stand-alone programs like iZotope’s outstanding RX 2. The general procedure is to capture a “noiseprint” of the noise, then the noise reduction algorithm subtracts that from the signal. This requires finding a portion of the vocal that consists only of hiss, saving that as a reference sample, then instructing the program to subtract anything with the sample’s characteristics from the vocal (Fig. 1).
Fig. 1: A good noise reduction algorithm will not only reduce mic preamp hiss, but can help create a more “transparent” overall sound. This shot from iZotope RX (the precursor to RX 2) shows the waveform in the background that's about to be de-noised, and in the front window, a graph that shows the noise profile, input, and output.
There are two cautions, though. First, make sure you sample the hiss only. You’ll need only a hundred milliseconds or so. Second, don’t apply too much noise reduction; 6-10dB should be enough, especially for reasons that will become obvious in the next section. Otherwise, you may remove parts of the vocal itself, or add artifacts, both of which contribute to artificiality.
Removing the hiss makes for a much more open vocal sound that also prevents “clouding” the other instruments.
Now that we’ve reduced the overall hiss level, it’s time to delete all the silent sections (which are seldom truly silent) between vocal passages. If we do this the voice will mask hiss when it’s present, and when there’s no voice, there will be no hiss at all.
Some programs offer an option to essentially gate the vocal, and use that as a basis to remove sections below a particular level. While this semi-automated process saves time, sometimes it’s better (albeit more tedious) to remove the space between words manually. This involves defining the region you want to remove; from there, different programs handle creating silence differently. Some will have a “silence” command that reduces the level of the selected region to zero. Others will require you to alter level, like reducing the volume by “-Infinity” (Fig. 2).
Fig. 2: Cutting out all sound between vocal passages will help clean up the vocal track. Note that with Sound Forge, an optional automatic crossfade can help reduce any abrupt transition between the processed and unprocessed sections.
Furthermore, the program may introduce a crossfade between the processed and unprocessed section, thus creating a less abrupt transition; if it doesn’t, you’ll probably need to add a fade-in from the silent section to the next section, and a fade-out when going from the vocal into a silent section.
I feel that breath inhales are a natural part of the vocal process, and it’s a mistake to get rid of these entirely. For example, an obvious inhale cues the the listener that the subsequent vocal section is going to “take some work.”
That said, though, applying any compression later on will bring up the levels of any vocal artifacts, possibly to the point of being objectionable. I use one of two processes to reduce the level of artifacts.
The first option is to simply define the region with the artifact, and reduce the gain by 3-6dB (Fig. 3). This will be enough to retain the essential character of an artifact, but make it less obvious compared to the vocal.
Fig. 3: The highlighted section is an inhale, which is about to be reduced by about -7dB.
The second option is to again define the region, but this time, apply a fade-in (Fig. 4). This also may provide the benefit of fading up from silence if silence precedes the artifact.
Fig. 4: Imposing a fade-in over an artifact is another way to control a sound without killing it entirely.
Speaking of fade-ins, they're also useful for reducing the severity of "p-pops" (Fig. 5) This is something that can be fixed within your DAW as well as in a digital audio editing program.
Fig. 5: Splitting a clip just before a p-pop, then fading in, can minimize the p-pop. The length of the fade can even control how much of the "p" sound you want to let through.
Mouth noises can be problematic, as these are sometimes short, “clickey” transients. In this case, sometimes you can just cut the transient and paste some of the adjoining signal on top of it (choose an option that mixes the signal with the area you removed; overwriting might produce a discontinuity at the start or end of the pasted region).
A lot of people rely on compression to even out a vocal’s peaks. That certainly has its place, but there’s something else you can try first: Phrase-by-phrase normalization.
Unless you have the mic technique of a K. D. Lang, the odds are excellent that some phrases will be softer than others—not intentionally due to natural dynamics, but as a result of poor mic technique, running out of breath, etc. If you apply compression, the lower-level passages might not be affected very much, whereas the high-level ones will sound “squashed.” It’s better to edit the vocal to a consistent level first, before applying any compression, as this will retain more overall dynamics. If you need to add an element of expressiveness later on that wasn’t in the original vocal (e.g., the song gets softer in a particular place, so you need to make the vocal softer), you can do this with judicious use of automation.
Unpopular opinion alert: Whenever I mention this technique, self-appointed “audio professionals” complain in forums that I don’t know what I’m talking about, because no real engineer ever uses normalization. However, no law says you have to normalize to zero—you can normalize to any level. For example, if a vocal is too soft but part of that is due to natural dynamics, you can normalize to, say, -6dB or so in comparison to the rest of the vocal’s peaks. (On the other hand with narration, I often do normalize everything to as consistent a level as possible, as most dynamics with narration occurs within phrases.)
Referring to Fig. 6, the upper waveform is the unprocessed vocal; the lower waveform shows the results of phrase-by-phrase normalization. Note how the level is far more consistent in the lower waveform.
Fig. 6: In the lower waveform, the sections in lighter blue have been normalized. Note that these sections have a higher peak level than the equivalent sections in the upper waveform.
However, be very careful to normalize entire phrases. You don’t want to get so involved in this process that you start normalizing, say, individual words. Within any given phrase there will be a certain internal dynamics, and you definitely want to retain this.
DSP is a beautiful thing: Now our vocal is cleaner, of a more consistent level, and has any annoying artifacts tamed — all without reducing any natural qualities the vocal may have. At this point, you can start doing more elaborate processes like pitch correction (but please, apply it sparingly and rarely!), EQ, dynamics control, and reverb. But as you add these, you’ll be doing so on a firmer foundation.
Craig Anderton is Editor in Chief of Harmony Central and Executive Editor of Electronic Musician magazine. He has played on, mixed, or produced over 20 major label releases (as well as mastered over a hundred tracks for various musicians), and written over a thousand articles for magazines like Guitar Player, Keyboard, Sound on Sound (UK), and Sound + Recording (Germany). He has also lectured on technology and the arts in 38 states, 10 countries, and three languages.