Wednesday, January 25, 2012

Bypass Captcha using Python and Tesseract OCR engine

A CAPTCHA is a type of challenge-response test used in computing as an attempt to ensure that the response is generated by a person. The process usually involves one computer (a server) asking a user to complete a simple test which the computer is able to generate and grade.The term "CAPTCHA" was coined in 2000 by Luis von Ahn, Manuel Blum, Nicholas J. Hopper, and John Langford (all of Carnegie Mellon University). It is an acronym based on the word "capture" and standing for "Completely Automated Public Turing test to tell Computers and Humans Apart".

In this post I am going to tell you guys how to crack weak captcha s using python and Tesseract OCR engine.Few days back I was playing around with an web application.The application was using a captcha as an anti automation technique when taking users feedback.

First let me give you guys a brief idea about how the captcha was working in that web application.
Inspecting the captcha image I have found that the form loads the captcha image in this way:
<img src="http://www.site.com/captcha.php"> 
From this you can easily understand that the “captcha.php” file returns an image file.
If we try access the url http://www.site.com/captcha.php each and every time it generates an image with a new random digit.
To make this clearer to you, Let me give you an example
Suppose after opening the feedback form you got few text fields and a captcha.Suppose at a certain time the captcha loaded with a number for ex. "4567".
So if you use that code "4567" the form will be submitted successfully.

Now the most interesting thing was if you copy the captcha image url (which is http://www.site.com/captcha.php in this case) and open the image in new tab of same browser ,the cpatcha will load with a different number as I have told you earlier. Suppose you have got "9090" this time. Now if you try to submit the feedback form with the number that’s was loaded earlier with the feedback form( which was "4567" )the application will not accept that form. If you enter “9090” then the application will accept that form.
For more clear idea I have created this simple Fig.


Now my strategy to bypass this anti automation techniques was
1)Download the image only from 
http://www.site.com/captcha.php 
2)Feed that image to OCR Engine
3)Craft an http POST request with all required parameter and the decoded captcha code, and POST it.

Now what is happening here??

When you are requesting the image file, the server will do steps 1 to 5 as shown in figure.
Now when we are posting the http request, the server will match the received captcha code with the value that was temporarily stored. Now the code will definitely match and server will accept the form.

Now I have used this Python Script to automated this entire process.


from PIL import Image
import ImageEnhance
from pytesser import *
from urllib import urlretrieve
 
def get(link):
    urlretrieve(link,'temp.png')
 
get('http://www.site.com/captcha.php');
im = Image.open("temp.png")
nx, ny = im.size
im2 = im.resize((int(nx*5), int(ny*5)), Image.BICUBIC)
im2.save("temp2.png")
enh = ImageEnhance.Contrast(im)
enh.enhance(1.3).show("30% more contrast")
 
imgx = Image.open('temp2.png')
imgx = imgx.convert("RGBA")
pix = imgx.load()
for y in xrange(imgx.size[1]):
    for x in xrange(imgx.size[0]):
        if pix[x, y] != (0, 0, 0, 255):
            pix[x, y] = (255, 255, 255, 255)
imgx.save("bw.gif", "GIF")
original = Image.open('bw.gif')
bg = original.resize((116, 56), Image.NEAREST)
ext = ".tif"
bg.save("input-NEAREST" + ext)
image = Image.open('input-NEAREST.tif')
print image_to_string(image)

Here I am only posting code of OCR engine. If your are a python lover like me you can use "httplib" python module to do the rest part.This script is not idependent. pytesser python module is requred to run this script.PyTesser is an Optical Character Recognition module for Python. It takes as input an image or image file and outputs a string.
PyTesser uses the Tesseract OCR engine, converting images to an accepted format and calling the Tesseract executable as an external script.

You can get this package @ http://code.google.com/p/pytesser/

The script works in this way.
1)First the script will download the captcha image using python module "urlretrive"
After that It will try to clean backgroug noises.

2)When this is done the script will make the image beigger to better understading.
3)At last it will feed that processed image to OCR engine.
Here is another python script which is very useful while testing captchas.You can add these line to your script if the taget captcha image is too small.This python script can help you to change resolution of any image.


from PIL import Image
import ImageEnhance

im = Image.open("test.png")
nx, ny = im.size
im2 = im.resize((int(nx*5), int(ny*5)), Image.BICUBIC)
im2.save("final_pic.png")
enh = ImageEnhance.Contrast(im)
enh.enhance(1.3).show("30% more contrast")

Thanks for reading.I hope It was helpful.Feel free to share and drop comments.