Wednesday, January 25, 2012

Bypass Captcha using Python and Tesseract OCR engine

A CAPTCHA is a type of challenge-response test used in computing as an attempt to ensure that the response is generated by a person. The process usually involves one computer (a server) asking a user to complete a simple test which the computer is able to generate and grade.The term "CAPTCHA" was coined in 2000 by Luis von Ahn, Manuel Blum, Nicholas J. Hopper, and John Langford (all of Carnegie Mellon University). It is an acronym based on the word "capture" and standing for "Completely Automated Public Turing test to tell Computers and Humans Apart".

In this post I am going to tell you guys how to crack weak captcha s using python and Tesseract OCR engine.Few days back I was playing around with an web application.The application was using a captcha as an anti automation technique when taking users feedback.

First let me give you guys a brief idea about how the captcha was working in that web application.
Inspecting the captcha image I have found that the form loads the captcha image in this way:
<img src="http://www.site.com/captcha.php"> 
From this you can easily understand that the “captcha.php” file returns an image file.
If we try access the url http://www.site.com/captcha.php each and every time it generates an image with a new random digit.
To make this clearer to you, Let me give you an example
Suppose after opening the feedback form you got few text fields and a captcha.Suppose at a certain time the captcha loaded with a number for ex. "4567".
So if you use that code "4567" the form will be submitted successfully.

Now the most interesting thing was if you copy the captcha image url (which is http://www.site.com/captcha.php in this case) and open the image in new tab of same browser ,the cpatcha will load with a different number as I have told you earlier. Suppose you have got "9090" this time. Now if you try to submit the feedback form with the number that’s was loaded earlier with the feedback form( which was "4567" )the application will not accept that form. If you enter “9090” then the application will accept that form.
For more clear idea I have created this simple Fig.


Now my strategy to bypass this anti automation techniques was
1)Download the image only from 
http://www.site.com/captcha.php 
2)Feed that image to OCR Engine
3)Craft an http POST request with all required parameter and the decoded captcha code, and POST it.

Now what is happening here??

When you are requesting the image file, the server will do steps 1 to 5 as shown in figure.
Now when we are posting the http request, the server will match the received captcha code with the value that was temporarily stored. Now the code will definitely match and server will accept the form.

Now I have used this Python Script to automated this entire process.


from PIL import Image
import ImageEnhance
from pytesser import *
from urllib import urlretrieve
 
def get(link):
    urlretrieve(link,'temp.png')
 
get('http://www.site.com/captcha.php');
im = Image.open("temp.png")
nx, ny = im.size
im2 = im.resize((int(nx*5), int(ny*5)), Image.BICUBIC)
im2.save("temp2.png")
enh = ImageEnhance.Contrast(im)
enh.enhance(1.3).show("30% more contrast")
 
imgx = Image.open('temp2.png')
imgx = imgx.convert("RGBA")
pix = imgx.load()
for y in xrange(imgx.size[1]):
    for x in xrange(imgx.size[0]):
        if pix[x, y] != (0, 0, 0, 255):
            pix[x, y] = (255, 255, 255, 255)
imgx.save("bw.gif", "GIF")
original = Image.open('bw.gif')
bg = original.resize((116, 56), Image.NEAREST)
ext = ".tif"
bg.save("input-NEAREST" + ext)
image = Image.open('input-NEAREST.tif')
print image_to_string(image)

Here I am only posting code of OCR engine. If your are a python lover like me you can use "httplib" python module to do the rest part.This script is not idependent. pytesser python module is requred to run this script.PyTesser is an Optical Character Recognition module for Python. It takes as input an image or image file and outputs a string.
PyTesser uses the Tesseract OCR engine, converting images to an accepted format and calling the Tesseract executable as an external script.

You can get this package @ http://code.google.com/p/pytesser/

The script works in this way.
1)First the script will download the captcha image using python module "urlretrive"
After that It will try to clean backgroug noises.

2)When this is done the script will make the image beigger to better understading.
3)At last it will feed that processed image to OCR engine.
Here is another python script which is very useful while testing captchas.You can add these line to your script if the taget captcha image is too small.This python script can help you to change resolution of any image.


from PIL import Image
import ImageEnhance

im = Image.open("test.png")
nx, ny = im.size
im2 = im.resize((int(nx*5), int(ny*5)), Image.BICUBIC)
im2.save("final_pic.png")
enh = ImageEnhance.Contrast(im)
enh.enhance(1.3).show("30% more contrast")

Thanks for reading.I hope It was helpful.Feel free to share and drop comments.

19 comments:

  1. Really nice! I was looking for that!

    I will surely test it out!

    ReplyDelete
  2. Nice work mate! Trying out the same this weekend!

    ReplyDelete
  3. Great research and nice way to tell

    ReplyDelete
  4. Replies
    1. I have tested this with very easy one! similar to this one

      https://lh4.ggpht.com/ZAAXYW2mlL0L0Ys7bbBSMyCGJwcUL1urk59a9Dy3fchDb__W-igiIW4ua-Y2bSbuyNfuag=s71

      and it was almost 100% accurate!

      Delete
    2. i try it do to for this, 0% ))
      https://dl.dropbox.com/u/59666091/1.png
      https://dl.dropbox.com/u/59666091/2.png

      Delete
    3. Maybe you can help me with doint symbols more in line (not changing in sinus) and also do something with background? Thank you. Will wait for you answer.

      Delete
  5. with
    https://lh4.ggpht.com/ZAAXYW2mlL0L0Ys7bbBSMyCGJwcUL1urk59a9Dy3fchDb__W-igiIW4ua-Y2bSbuyNfuag=s71
    it gives me result = I bra

    ReplyDelete
  6. Ӏ've read some excellent stuff here. Certainly price bookmarking for revisiting. I wonder how much effort you put to create such a wonderful informative website.
    Also see my web site > Facebook Captcha

    ReplyDelete
  7. If somebody needs only digits recognition in pytesser then feel free to see my sollution http://ppiotrow.blogspot.com/2013/01/pytesser-only-digits-recognition.html

    ReplyDelete
  8. Every fuel hose that connects an external gas tank to an outboard engine has an arrow printed on its hand pump that small bladder that contains a check valve and sends fuel from tank to engine with a few squeezes.

    ReplyDelete
  9. Hey!

    I used your results in order to break (not very eficient) hard CAPTCHAS (Source #2):

    http://bokobok.fr/bypassing-a-captcha-with-python/

    ReplyDelete
  10. Hello Everyone,

    I tried your code but it is not able to recognize such captcha:
    http://i46.tinypic.com/2mxiexv.jpg
    http://i49.tinypic.com/n53lth.jpg

    I will appreciate your answers.

    ReplyDelete
  11. Wow! its realy useful to us, its easy to follow and implement! Thank you for your exciting information,..

    Easy Captcha Solving

    ReplyDelete
  12. hurray...............this is very informative and useful.........................................thanks for sharing.............keep blogging.............

    captcha bypass services

    ReplyDelete
  13. Hi Mandal,
    first I have to note that I'm new to Python. I tried your code, and had to do a few modifications to make it work with particular Captcha I'm using. I can post the code, 'cause my personal opinion that works much better. The problem I have is making the part with httplib. Once I've decoded the Captcha, I cannot find the way tricking it that it came from the original source (I'm using it to log in to a website that has 10 min inactivity logout policy, while log in has a lot of queries that need to be manually typed).
    Anyway, your code was very helpful, and a great startup point.
    Thanks,
    M.Zinovic

    ReplyDelete
  14. Hi,
    the captcha that i am trying to break is http://www.afreesms.com/image.php
    it's an easy 7 letter code. always the same type of letter, color, size. MY problem is: I am a noob. I don't know what i must do in order to get this working. If someone could hel, that would be great.

    thanks

    ReplyDelete
  15. Hi,
    Look like the DecaptchaBlog is very excellent, I like to read source code and Decaptcha verification then Bypasscaptcha explanation is very excellent.. the Decaptchaand the Bypasscaptcha is very useful for your guidance.. Really great informativ blog..
    Thanks to all..
    Decaptcha

    ReplyDelete