Saturday, April 19, 2014

Attacking Audio "reCaptcha" using Google's Web Speech API

I had a fun project months back, Where I had to deal with digital signal processing and low level audio processing. I was never interested in DSP and all other control system stuffs, But when question arises about breaking things, every thing becomes interesting :) . In this post i'm going to share one technique to fully/ partially bypass reCaptcha test. This is not actually a vulnerability but its better if we call it "Abuse of functionality".

Disclaimer : Please remember this information is for Educational Purpose only and should not be used for malicious purpose. I will not assume any liability or responsibility to any person or entity with respect to loss or damages incurred from information contained in this article.

1. What is Captcha

A CAPTCHA is a program that protects websites against bots by generating and grading tests that humans can pass but current computer programs cannot. The term CAPTCHA (for Completely Automated Public Turing Test To Tell Computers and Humans Apart) was coined in 2000 by Luis von Ahn, Manuel Blum, Nicholas Hopper and John Langford of Carnegie Mellon University.

2. What is Re-captcha

reCAPTCHA is a free CAPTCHA service by Google, that helps to digitize books, newspapers and old time radio shows. More details can be found here.

3. Audio reCaptcha

reCAPTCHA also comes with an audio test to ensure that blind users can freely navigate.

4. Main Idea: Attacking Audio reCaptcha using Google's Web Speech API Service

5. Google Web Speech API

Chrome has a really interesting new feature for HTML5 speech input API. Using this user can talk to computer using microphone and Chrome will interpret it. This feature is also available for Android devices. If you are not aware of this feature you can find a live demo here.

I was always very curious about the Speech recognition API of chrome. I tried sniff the api/voice traffic using Wireshirk but this API uses SSL. :(.

So finally I started browsing the Chromium source code repo. Finally I found exactly what I wanted.

It pretty simple, First the audio is collected from the mic, and then it posts it to Google web service, which responds with a JSON object with the results.  The URL which handles the request is :

Another important thing is this api only accepts flac audio format.

6. Programatically Accessing Google Web Speech API(Python)

Below python script was written to send a flac audio file to Google Web Speech API and print out the JSON response.

./ hello.flac

Accessing Google Web Speech API using Pyhon
Author : Debasish Mandal


import httplib
import sys

print '[+] Sending clean file to Google voice API'
f = open(sys.argv[1])
data =
google_speech = httplib.HTTPConnection('')
google_speech.request('POST','/speech-api/v1/recognize?xjerr=1&client=chromium&lang=en-US',data,{'Content-type': 'audio/x-flac; rate=16000'})
print google_speech.getresponse().read()

7. Thoughts on complexity of reCaptcha Audio Challenges

While dealing with audio reCaptcha, you may know , it basically gives two types of audio challenges. One is pretty clean and simple (Example : . percentage of noise is very less in this type of challenges. 

Another one is very very noisy and its very difficult for even human to guess (Example : Constant hissss noise and overlapping voice makes it really difficult to crack human. You may wanna read this discussion on complexity of audio reCapctha.

In this post I will mainly cover the technique / tricks to solve the easier one using Google Speech API. Although I've tried several approaches to solve the complex one, but as I've already said, its very very had to guess digits even for human :( .

8. Cracking the Easy Captcha Manually Using Audacity and Google Speech API

Google Re-captcha allows user to download audio challenges in mp3 format. And Google web speech API accepts audio in flac format. So if we just normally convert the mp3 audio challenge to flac format of frame rate 16000 its does not work :( .  Google Chrome Speech to text api does not respond to this sound.

But after some experiment and head scratching, it was found that we can actually make Google web speech api convert the easy captcha challenge to text for us, if we can process the audio challenge little bit. In this section i will show how this audio manipulation can be done using Audacity.

To manually verify that first I'm going to use a tool called Audacity to do necessary changes to the downloaded mp3 file. 

Step 1: Download the challenge as mp3 file.
Step 2: Open the challenge audio in Audacity.

Step 3: Copy the first digit speaking sound from main window and paste it in a new window. So here we will only have a one digit speaking sound.

Step 4: From effect option make it repetitive once. (Now It should speak the same digit twice).

Lets say for example if the main challenge is  7 6 2 4 6, Now we have only first digit challenge in wav format which having the digit 7 twice.

Step 5: Export the updated audio in WAV format.
Step 6: Now convert the wav file to flac format using sox tool and send it to Google speech server using the python script posted in section 6. And we will see something like this.

Note: In some cases little bit amplification might be required if voice strength is too low.

debasish@debasish ~/Desktop/audio/heart attack/final $ sox cut_0.wav -r 16000 -b 16 -c 1 cut_0.flac lowpass -2 2500
debasish@debasish ~/Desktop/audio/heart attack/final $ python cut_0.flac 

Great! As you can see first digit of the audio challenge has been resolved by Google Speech. :) :) :) Now in same manner we can solve the entire challenge. In next section we will automate the same thing using python and it's wave module. 

9. Automation using Python and it's WAVE Module

Before we jump into processing of raw WAV audio using low level python API, its important to have some idea of how digital audio actually works. In above process we've extracted the most louder voices using audacity but to do it automatically using python, we must have some understanding of how digital audio is actually represented in numbers.

9.1. How is audio represented with numbers

There is an excellent stackoverflow post which explains the same. In short ,we can say audio is nothing but a vibration. Typically, when we're talking about vibrations of air between approximately 20Hz and 20,000Hz. Which means the air is moving back and forth 20 to 20,000 times per second. If somehow we can measure that vibration and convert it to an electrical signal using a microphone, we'll get an electrical signal with the voltage varying in the same waveform as the sound. In our pure-tone hypothetical, that waveform will match that of the sine function.

Now, we have an analogue signal, the voltage. Still not digital. But, we know this voltage varies between (for example) -1V and +1V. We can, of course, attach a volt meter to the wires and read the voltage.  Arbitrarily, we'll change the scale on our volt meter. We'll multiple the volts by 32767. It now calls -1V -32767 and +1V 32767. Oh, and it'll round to the nearest integer.

Now after having a set of signed integers we can easily draw an waveform using the data sets.

X axis -> Time
Y axis -> Amplitude (signed integers)

Now, if we attach our volt meter to a computer, and instruct the computer to read the meter 44,100 times per second. Add a second volt meter (for the other stereo channel), and we now have the data that goes on an audio CD. This format is called stereo 44,100 Hz, 16-bit linear PCM. And it really is just a bunch of voltage measurements.

9.2. WAVE File Format walk through using Python

As an example lets open up a very small wav file with a hex editor.


9.3. Parsing the same WAV file using Python

The wave module provides a convenient interface to the WAV sound format. It does not support compression/decompression, but it does support mono/stereo. Now we are going to parse the same wav file using python wave module and try to relate what we have just seen in hex editor.

Let's write a python script:

import wave 
f ='sample.wav', 'r') 
print '[+] WAV parameters ',f.getparams() 
print '[+] No. of Frames ',f.getnframes() 
for i in range(f.getnframes()): 
    single_frame = f.readframes(1) 
    print single_frame.encode('hex') 

Line 1 imports python wav module.
Line 2: Opens up the sample.wav file.
Line 3: getparams() routine returns a tuple (nchannels, sampwidth, framerate, nframes, comptype, compname), equivalent to output of the get*() methods.
Line 4: getnframes() returns number of audio frames.
Line 5,6,7: Now we are iterating through all the frames present in the sample.wav file and printing them one by one.
Line 8: Closes the opened file

Now if we run the script we will find something like this:

[+] WAV parameters (1, 2, 44100, 937, 'NONE', 'not compressed')
[+] No. of Frames 937
[+] Sample 0 = 62fe    <- Sample 1
[+] Sample 1 = 99fe   <- Sample 2
[+] Sample 2 = c1ff    <- Sample 3
[+] Sample 3 = 9000
[+] Sample 4 = 8700
[+] Sample 5 = b9ff
[+] Sample 6 = 5cfe
[+] Sample 7 = 35fd
[+] Sample 8 = b1fc
[+] Sample 9 = f5fc
[+] Sample 10 = 9afd
[+] Sample 11 = 3cfe
[+] Sample 12 = 83fe
[+] ....
and so on,

It should make sense now. In first line we get number of channels, sample width , frame/sample rate,total number of frames etc etc. Which is exact same what we saw in the hex editor (Section 9.2). From second line it stars printing the frames/sample which is also same as what we have seen in hex editor. Each channel is 2 bytes long because the audio is 16 bit. Each channel will only be one byte. We can use the getsampwidth() method to determine this. Also, getchannels() will determine if its mono or stereo.

Now its time to decode each and every frames of that file. They're actually little-endian. So we will now modify the python script little bit so that we can get the exact value of each frame. We can use python struct module to decode the frame values to signed integers.

import wave 
import struct 

f ='sample.wav', 'r') 
print '[+] WAV parameters ',f.getparams() 
print '[+] No. of Frames ',f.getnframes() 
for i in range(f.getnframes()): 
    single_frame = f.readframes(1) 
    sint = struct.unpack('<h', single_frame) [0]
    print "[+] Sample ",i," = ",single_frame.encode('hex')," -> ",sint[0] 

This script will print something like this:

[+] WAV parameters (1, 2, 44100, 937, 'NONE', 'not compressed')
[+] No. of Frames 937
[+] Sample 0 = 62fe -> -414
[+] Sample 1 = 99fe -> -359
[+] Sample 2 = c1ff -> -63
[+] Sample 3 = 9000 -> 144
[+] Sample 4 = 8700 -> 135
[+] Sample 5 = b9ff -> -71
[+] Sample 6 = 5cfe -> -420
[+] Sample 7 = 35fd -> -715
[+] Sample 8 = b1fc -> -847
[+] Sample 9 = f5fc -> -779
[+] Sample 10 = 9afd -> -614
[+] Sample 11 = 3cfe -> -452
[+] Sample 12 = 83fe -> -381
[+] Sample 13 = 52fe -> -430
[+] Sample 14 = e2fd -> -542

Now what we can see we have a set of positive and negative integers. Now you should be able to connect the dots. What I have explained in section 9.1. 

So now if we plot the same positive and negative values in a graph will find complete wave form. Lets do it using python matlab module.

import wave 
import struct 
import matplotlib.pyplot as plt 

data_set = [] 
f ='sample.wav', 'r') 
print '[+] WAV parameters ',f.getparams() 
print '[+] No. of Frames ',f.getnframes() 
for i in range(f.getnframes()): 
    single_frame = f.readframes(1)
    sint = struct.unpack('<h', single_frame)[0]

This should form following graph

Now you must be familiar with this type of graph. This is what you see in SoundCloud, But definitely more complex one.

So now we have clear understanding of how audio represented in numbers. Now it will be easier for readers to understand how the python script ( shared in section 9.3 ) actually works.

9.3. Python Script

In this section we will develop a script which automate the steps we did using Audacity in Section 8. Below python script will try extract loud voices from input wav file and generate separate wav files.

Once the main challenge is broken into parts we can easily convert it to flac format and send each parts of the challenge to Google speech API using the Python script shared in section 6.

9.4. Demo:

10. Attempt to Crack the Difficult(noisy) audio challenge

So we have successfully broken down the easy challenge.Now its time to give the difficult one a try. So I started with one noisy captcha challenge. You can see the matlab plot of the same noisy audio challenge below.

In above figure we can understand presence of a constant hisss noise. One of the standard ways to analyze sound is to look at the frequencies that are present in a sample. The standard way of doing that is with a discrete Fourier transform using the fast Fourier transform or FFT algorithm. What these basically in this case is to take a sound signal isolate the frequencies of sine waves that make up that sound.

10.1. Signal Filtering using Fourier Transform

Lets get started with a  simple example. Consider a signal consisting of a single sine wave, s(t)=sin(w∗t). Let the signal be subject to white noise which is added in during measurement, Smeasured(t)=s(t)+n. Let F be the Fourier transform of S. Now by setting the value of F to zero for frequencies above and below w, the noise can be reduced. Let Ffiltered be the filtered Fourier transform. Taking the inverse Fourier transform of Ffiltered yields Sfiltered(t). 

The way to filter that sound is to set the amplitudes of the fft values around X Hz to 0. In addition to filtering this peak, It's better to remove the frequencies below the human hearing range and above the normal human voice range. Then we recreate the original signal via an inverse FFT.

I have written couple of scripts which successfully removes the constant hiss noise from the audio file but main challenge is the overlapping voice. Over lapping voice makes it very very difficult even for human to guess digits. Although I was not able to successfully crack any of given difficult challenges using Google Speech API still I've shared few noise removal scrips (using Fourier Transform). 

These scripts can be found in the GitHub project page. There is tons of room for improvement of all this scripts.

11. Code Download

Every code I've written during this project is hosted here:  

12. Conclusion

When I reported this issue to Google security team, they've confirmed that, this mechanism is working as intended. The more difficult audio patterns are only triggered only when abuse/non-human interaction is suspected. So as per the email communication noting is going to be changed to stop this.

Thanks for reading. I hope you have enjoyed. Please drop me an email/comment in case of any doubt and confusion.

13. References