Friday, March 9, 2018

The Art of Large Scale Cumulative Binary Diffing

I've been performing patch diffing on various windows based softwares for past couple of years now. Patch diffing is a process of comparing two binary builds of the same code – a known vulnerable one and the one containing the security fix for a vulnerability. Bindiff is very popular among security researchers because it is generally used to gather information about patched security vulnerabilities, find root causes and vectors.

In 2017 December I've delivered a talk at BlackHat Europe where I've showcased many patched vulnerabilities in VMWare workstation, mostly identified using binary patch diffing different releases of VMWare workstation.

Performing BinDiffing on a huge software is very challenging specially when affected components are not known. Let's take the example of VMWare workstation. If we unpack VMWare workstation installer we find nearly 300 binary files. Among these huge number of binary files, finding the components on which security fixes were applied is very difficult and time consuming. Especially for my VMWare research I had to do this on a cumulative basis because I had to analyze each and every released security patches of last one year. Following screenshots lists out VMWare workstation releases and their version details.

To be able to identify the components modified in each workstation release and to understand the modification logic, we had to perform binary diffing like this


Performing binary diffing on each and every patch released by vmware for last one year was indeed very tedious. So I thought of automating the whole process. In this blog post I'm going briefly explain the way it was done and also release the source code of the python library I've written to automate this. The same code can be re-used (with little modification) to perform cumulative binary diffing on any windows based software.

Manually Unpacking The Installer

There were some manual efforts involved initially. Manually unpacking the installer files was the first step. For vmware workstation it can be done using following command 

VMware-workstation-full-xxxx-xxxx.exe /extract "folder path"

For other softwares files can be directly taken from "Program Files" folder.

After the installer files are unpacked they are kept in an organized way in same directory. Point to be noted here directory structure of all the unpacked file should be same like this because these paths would be accessed the the Mass binary diffing program.

C:\MassDiffing\VMware-workstation-full-12.5.0.exe\x64\vmware-vmx.exe
C:\MassDiffing\VMware-workstation-full-12.5.1.exe\x64\vmware-vmx.exe
C:\MassDiffing\VMware-workstation-full-12.5.2.exe\x64\vmware-vmx.exe

Introducing MassDiffer

MassDiffer is small python program developed to automate the tedious process of cumulative binary diffing. This script has few dependencies 
  1. IDA Pro idaq.exe & idaq64.exe
  2. Zynamics Binexport
  3. BinDiff  - differ.exe & differ64.exe

This is exactly how the MassDiffer works,

1. Path of two directories containing unpacked installer files (old release and new release) are passed to the program for example C:\MassDiffing\VMware-workstation-full-12.5.0.exe\ and C:\MassDiffing\VMware-workstation-full-12.5.1.exe\

2. The program traverses two directories, tries to find out files which can have binary code and create a mapping between the old version and new version of the file.




3. After the mapping is created it checks if the binary image version has changed in the new release or not. If binary images are identical in both the releases nothing will be done to those files.



4. Next, if it finds any change in binary image file version it process them. First it generates set of IDB files for both old and new version.

5. From the IDB file it generates BinExport file with help of zynamics binexport.

6. From BinExport files it generates BinDiff file. Point to be noted here BinDiff files are nothing but a sqlite database, where all the diffing results are stored. Simple sqlite query can be made to extract information from the BinDiff file about the modified function in two binary files.

7. Next step, BinDiff database file is parsed and modified function details and addresses are extracted from BinDiff database.

8. Once these information are gathered a MS Excel report is prepared which gives a visual representation of the modified components. Following screenshot shows a sample MassDiff output for VMware-workstation version 12.5.1 and 12.5.2 release. As it can be easily pointed out that in this release VMWare had only applied fix to two binary files they are vmware-vmx.exe and vmwarebase.dll and only few functions in these binary files were modified. This reduces a lot of repetitive work.



Source Code :

Source code is available at https://github.com/debasishm89/MassDiffer
The program was developed and tested on IDA Pro Version 6.5 + Bindiff version 4.2.

Thanks for reading. Hope you've enjoyed :) 

Sunday, December 10, 2017

BlackHat Europe 2017 : THE GREAT ESCAPES OF VMWARE

This December 2017 Me & my colleague Yakun Zhang delivered a talk at Blackhat Europe 2017 Briefings on VMWare escapes. Blackhat Europe is an annual information security conference, scheduled on December 4 2017 to December 7, 2017, in the ExCeL London, located at 1 Western Gateway, London E16 1XL.We have talked about reverse engineering of vmware, attacking hypervisor isolation and some virtual machine escape attacks against vmware.


Talk Abstract

Virtual machine escape is the process of breaking out of the virtual machine and interacting with the host operating system. VMWare recently fixed several bugs in their products that were allowing malicious code to escape sandbox. Some of these issues were exploited and reported during exploitation contest and while others reported individually by researchers. For very obvious reason details of this bugs are undisclosed. This paper presents a case study of VMWare VM escape vulnerabilities based on the analysis of different patches released by VMWare in recent past. 

Looking at the advisories published by VMWare in the last few months, reveals that there are many surfaces, that are being targeted by security researchers. To summarize, the attack surfaces would be as follows: 

A) RPC Request handler.
B) Virtual Printer.
C) VMWare Graphics Implementation.

Talking about vulnerabilities fixed in VMWare RPC layer, we see several CVEs (CVE-2017-4901, CVE-2016-7461 etc.) fixing security issues in RPC layers. This talk will cover end to end RPC implementation in VMWare workstation. It will cover everything from VMWare Backdoor in guest OS to different RPC command handler in host OS. We will uncover some of these fixed bugs in VMWare RPC layer by performing binary diffing on VMWare Workstation binaries. This talk will also showcase some of the PoCs developed from different VMware workstation patches.


VMWare's EMF file handler is one of most popular attack surfaces, when it comes to guest to host escape. VMSA-2016-0014 fixed several security issues in EMF file handling mechanism. EMF format is composed of many EMR data structures. TPView.dll parses every EMR structure in EMF file. In VMware, COM1 port is used by Guest to interact with Host printing proxy. EMF files are spool file format used in printing by windows. When a printing EMF file request comes from Guest, in host TPView.dll render the printing page. The TPView.dll holds the actual code which parses the EMF file structures. In our talk, we will be diving deep into this attack surface & uncover some of the vulnerabilities fixed in this area recently by performing binary diffing on VMWare work station binaries.


VMSA-2017-0006 resolved several security vulnerabilities in Workstation, Fusion graphics implementation which allows Guest to Host Escape. These vulnerabilities were mostly present in VMWare SVGA implementation. In this section of our talk we will cover implementation of VMWare virtual GPU through reverse engineering different guest components (vmx_fb.dll - VMware SVGA II Display Driver, vmx_svga.sys - VMware SVGA II Miniport) as well as host component (vmware-vmx.exe) where virtualize GPU code exist. The VMware virtual GPU provides several memory ranges which is used by Guest OS to communicate with the emulated device. These memory ranges are 2D frame buffer and FIFO Memory Queue. In FIFO memory queue, we write command that we want our GPU to process. The way VMWare handles and process these commands is error prone. This talk will uncover some of these bugs in SVGA command processing code and try to understand anatomy of issues by bin-diffing through VMWare binaries.

Slides:


References:


  • https://www.blackhat.com/eu-17/briefings.html#the-great-escapes-of-vmware-a-retrospective-case-study-of-vmware-g2h-escape-vulnerabilities
  • https://www.blackhat.com/docs/eu-17/materials/eu-17-Mandal-The-Great-Escapes-Of-Vmware-A-Retrospective-Case-Study-Of-Vmware-G2H-Escape-Vulnerabilities.pdf



Sunday, October 8, 2017

BruCON 2017: Browser Exploits? Grab ’em by the Collar!

This year in October I have delivered a talk about reliable detection of Browser Exploit at Brucon 2017. BruCON is an annual security and hacker(*) conference providing two days of an interesting atmosphere for open discussions of critical infosec issues, privacy, information technology and its cultural/technical implications on society. Organized in Belgium, BruCON offers a high quality line up of speakers, security challenges and interesting workshops.


Talk Abstract:


APT has become a hot topic in enterprise IT today. One of the softwares that we see becomes victim of APT attack more often is web browsers and the attack surface is becoming bigger and bigger every day.

TCP Live Stream Injection (https://en.wikipedia.org/wiki/Packet_injection) is a technique that we have seen, is being abused by various Internet Service Providers, Router vendors for decades. We have seen in the past, using this technique ISPs, router vendors intercepts HTTP traffic and inject arbitrary data silently into HTTP responses. This is usually done by injecting arbitrary JavaScript code into actual HTTP response body in real time. When the injected JavaScript code reaches client browser it performs various operations such as loading advertisements, information gathering etc.

This paper presents a generic browser exploit detection technique that uses the same Live Network Stream Code Injection technique to reliably catch browser exploits. The detection system can be considered as completely agent less and capable of detecting various techniques, used in modern browser exploitation. Unlike any other Host Based Intrusion Prevention Systems, to be able to generically detect and block browser exploits, no OS API hooking, dll injection or code injection is required in browser process.


Slides & Video Demos:


https://github.com/debasishm89/Brucon0x09

Talk Video:





References:


  • https://brucon0x092017.sched.com/event/BEBO/browser-exploits-grab-them-by-the-collar
  • https://2017.brucon.org/index.php/Debasish_Mandal

Saturday, May 20, 2017

OpenXMolar - A MS OpenXML Format Fuzzing Framework

i) OpenXMolar v 1.0

alt text
OpenXMolar is a Microsoft Open XML file format fuzzing framework, written in Python.

ii) Motivation Behind OpenXMolar

MS OpenXML office files are widely used and the attack surface is huge, due to complexity of the softwares that supports OpenXML format. Office Open XML files are zipped, XML-based file format. I could not find any easy to use OpenXML auditing tools/framework available on the internet which provides software security auditors a easy to use platform using which auditors can write their own test cases and tweak internal structure of Open XML files and run fuzz test (Example : Microsoft Office).

Hence OpenXMolar was developed, using which software security auditors can focus, only on writing test cases for tweaking OpenXML internal (XML and other ) files and the framework takes care of rest of the things like unpacking, packing of OpenXML files, Error handling, etc.

iii) Dependencies

OpenXMolar is written and tested on Python v2.7. OpenXMolar uses following third party libraries

winappdbg / pydbg

Debugger is an immense part of any Fuzzer. Open X-Molar supports two python debugger, one is winappdbg and another is pydbg. Sometimes installing pydbg on windows environment can be painful, and pydbg code base is not well maintained hence winappdbg support added to Open X-Molar. Its recommended that user use winappdbg.

pyautoit

Since we feed random yet valid data into target application during fuzzing, target application reacts in many different ways. During fuzzing the target application may throw different errors through different pop-up windows. To continue the fuzzing process, the fuzzer must handle these pop-up error windows properly. OpenXMolar uses PyAutoIT to suppress different application pop-up windows. PyAutoIt is Python binding for AutoItX3.dll

crash_binning.py

crash_binning is part of sulley framework. crash_binning.py is used only when you've selected pydbg as debugger. crash_binning.py is used to dump crash information. This is only required when you are using pydbg as debugger.

xmltodict

This is not core part of the Open X-Molar. The XML String Mutation module (FileFormatHandlers\xmlHandler.py) was written using xmltodict library.

iv) Architecture:

On a high level, OpenXMolar can be divided into few components.

OpenXMolar.py

This is the core component of this Tool and responsible for doing many important stuffs like the main fuzzing loop.

OfficeFileProcessor.py

This component mostly handles processing of OpenXML document such as packing, unpacking of openxml files, mapping them in memory, converting OpenXML document to python data structures etc.

PopUpKiller.py - PopUp/Error Message Handlers :

This component suppresses/kills unwanted pop-ups appeared during fuzzing.

FileFormatHandlers//

An OpenXML file may contain various files like XML files, Binary files etc. FileFormatHandlers are basically a collection of mutation scripts, responsible for handling different files found inside an OpenXML document and mutate them.

OXDumper.py

OXDumper.py decompresses OpenXML files provided in folder "OpenXMolar\BaseOfficeDocs\OpenXMLFiles" and output a python list of files present in the OpenXML file. OXDumper.py accepts comma separated file extensions. OXDumper.py is useful when you are targeting any specific set of files present in any OpenXML document.

crashSummary.py

crashSummary.py summarizes crashes found during fuzzing process in tabular format. The output of crashSummary.py should look like this:

alt text

v) Configuration File Walk through


The default configuration file 'config.py' is very well commented and explains all of its parameters really well. Please review the default config.py file thoroughly before running the fuzzer to avoid unwanted errors.

vi) Writing your Open XML internal File Mutation Scripts:


As said earlier, an OpenXML file package may contain various files like XML files, Binary files etc. FileFormatHandlers are basically a collection of mutation scripts, responsible for handling different files found inside an OpenXML document and mutate them. Generating effective test cases is the most important step in any fuzz testing process.

The motive behind OpenXMolar was to provide security auditors an easy & flexible platform on which fuzz tester can write their own test cases very easily for OpenXML files. When it comes to effective OpenXML format fuzzing, the main part is how we mutate different files (*.xml, *.bin etc) present inside OpenXML package (zip alike). To give users an idea of how file format handlers are written, two file format handlers are provided with this fuzzer, however they are very dumb in nature and not very effective.

Any file format handler module should be of following structure

# Import whatever you want.
class Handler():# The class name should be always 'Handler'
 def __init__(self):
  pass
 def Fuzzit(self,actual_data_stream): 
  # A function called Fuzzit must be present in Handler class
  # and it should return fuzzed data/xml string/whatever.
  # Note: Data type of actual_data_stream and data_after_mutation should always be same.

  return data_after_mutation

Once your file format handler module is ready you need to place the *.py file in FileFormatHandlers// folder and add the handler entry and associated file extension in config.py file like this :

FILE_FORMAT_HANDLERS = {'xml':'xmlHandler.py',
      'bin':'BinaryHandler.py',
      'rels':'xmlHandler.py',
      'vml':'xmlHandler.py'
      }


vii)Adding More POPUP / Errors Windows Handler

The default PopUpKiller.py file provided with Open X-Molar, is having few most occurred pop up / error windows handler for MS Word, MS Excel & Power Point. Using AutoIT Window Info tool (https://www.autoitscript.com/site/autoit/downloads/) you can add more POPUP / Errors Windows Handlers into 'PopUpKiller.py'. One example is given below.
alt text

So to be able to Handle the error pop up window shown in screen shot, following lines need to be added in : PopUpKiller.py

if "PowerPoint found a problem with content"  in autoit.win_get_text('Microsoft PowerPoint'):
 autoit.control_click("[Class:#32770]", "Button1")

viii)The First Run

This fuzzer is well tested on 32 Bit and 64 Bit Windows Platforms (32 Bit Office Process). All the required libraries are distributed with this fuzzer in 'ExtDepLibs/' folder. Hence if you have installed python v2.7, you are good to go.

To verify everything is at right place, better to run Open X-Molar with Microsoft Default XPS Viewer first time(C:\Windows\System32\xpsrchvw.exe). Place any *.oxps file in '\BaseOfficeDocs\OpenXMLOfficeFiles' and run OpenXMolar.py.

OpenXMolar.py accepts one command line argument which is the configuration file.


C:\Users\John\Desktop\OpenXMolar>python OpenXMolar.py config.py

[Warning] Pydbg was not found. Which is required to run this fuzzer. Install Pydbg First. Ignore if you have winappdbg installed.

   ____                    __   ____  __       _
  / __ \                   \ \ / /  \/  |     | |
 | |  | |_ __   ___ _ __    \ V /| \  / | ___ | | __ _ _ __
 | |  | | '_ \ / _ \ '_ \    > < | |\/| |/ _ \| |/ _` | '__|
 | |__| | |_) |  __/ | | |  / . \| |  | | (_) | | (_| | |
  \____/| .__/ \___|_| |_| /_/ \_\_|  |_|\___/|_|\__,_|_|
        | |
        |_|
        An MS OpenXML File Format Fuzzing Framework.
        Author : Debasish Mandal (twitter.com/debasishm89)

[+] 2017:05:05::23:11:23 Using debugger :  winappdbg
[+] 2017:05:05::23:11:23 POP Up killer Thread started..
[+] 2017:05:05::23:11:24 Loading base files in memory from :  BaseOfficeDocs\UnpackedMSOpenXMLFormatFiles
[+] 2017:05:05::23:11:24 Loading File Format Handler for extension :  xml => xmlHandler.py
[+] 2017:05:05::23:11:24 Loading File Format Handler for extension :  rels => xmlHandler.py
[+] 2017:05:05::23:11:24 Loading File Format Handler Done !!
[+] 2017:05:05::23:11:24 Starting Fuzzing
[+] 2017:05:05::23:11:25 Temp cleaner started...
[+] 2017:05:05::23:11:25 Cleaning Temp Directory...
...
...

ix) Open X-Molar in Action

Here is a very short video on running fuzztest on MS Office Word:

https://www.youtube.com/watch?v=b7n1tuFDl5A

x) Fuzzing Non-OpenXML Applications :

Due to the flexible structure of the fuzzer, this Fuzzer can also be used to fuzz other windows application. You just need do following :

  • In config.py add the target application binary (exe) and extension in APP_LIST of config.py
  • In config.py change OpenXMLFormat to False
  • Write your own File format mutation handler and place it in FileFormatHandlers/ folder
  • Add the newly added FileFormatHandler in FILE_FORMAT_HANDLERS of config.py
  • Provide some base files in folder OtherFileFormats/
  • Add custom error / popup windows handler in PopUpKiller.py using Au3Info tool if required.And you're good to go.

xi) Few More Points about OpenXMolar:

Fuzzing Efficiency: To maximize fuzzing efficiency OpenXMolar doesn't read the provided base files again and from disk. While starting up, it loads all base files in memory and convert them into easy to manage python data structures and mutate them straight from memory.

Auto identification of internal files of OpenXML package : An Open XML file package may contain various files like XML files, Binary files etc. OpenXMolar has capability to identify internal file types and based that chooses mutation script and mutate them. Please refer to the default config.py file (Param : AUTO_IDENTIFY_INTERNAL_FILE_FORAMT) for details.

xii) TODO

Improve Fuzzing Speed
New Feature / Bugs -> https://github.com/debasishm89/OpenXMolar/issues

xiii) License

This software is licensed under New BSD License although the following libraries are included with Open X-Molar and are licensed separately.

xiv) Source 


The source code is available here : https://github.com/debasishm89/OpenXMolar




Thursday, December 24, 2015

IEFuzz - A Static Internet Explorer Fuzzer

Today I'm sharing an IE Fuzzer, which was developed almost from scratch. Like many other softwares, browsers can also be fuzzed in two ways, a) Static and b) Dynamic.

Dynamic browser fuzzers are very popular, due to its speed, since they are purely written in JavaScript. However one common problem software security auditors face, while fuzzing browser dynamically, is 'Crash Reproduction'. You have to very careful while crafting your JS browser fuzzer (by placing logging code in right place), otherwise crash will not be reproducible.

Another option is, Static fuzzer. If you are fuzzing browsers using Static Test Cases, in 99% cases 'A crash' == 'A reproducible crash'.

How does 'IEFuzz' work?

  1. Launch IE
  2. Attach 'iexplore.exe' to debugger(pydbg) - To monitor any type of crash(Both in parent and child process).
  3. Generate a test case (html + javascript).
  4. Load the test case locally as file (file://c:/fuzzer/testcases/temp.html)in IE using win32COM.
  5. If no crash, re-generate a html test case and reload the test case using win32COM.(Note, we are not closing, re-opening IE here, We are just refreshing the same page but code/content of the page is different in every time. Which saves time significantly )
  6. In case of any kind of access violation, copy/save the test case to separate folder,  and kill IE completely.
  7. Go to step 1

This Static IE fuzzer is written in python. And following modules were used.

  1. pywin32com - Load / Reload *.html Test Cases
  2. pydbg - Monitor IE for Access Violation / Guard Page Violation.
  3. paimei - For crash dump generation.

Required Configuration Changes in IE

To run this Fuzzer you have to make following changes in IE: 

1. Since this fuzzer loads the test cases locally (eg. file://c:/fuzzer/testcases/temp.html) as .html file.
You must turn off IE's ActiveX warning prompt by following below instructions.

Tools (menu) -> Internet Options -> Security (tab) -> Custom Level (button) -> Disable Automatic prompting for ActiveX controls.

2. You also need to disable IE protected mode to be able to control Internet Explorer using Python 'win32com'. Please be aware of the risks.

 -> Internet Options -> Security -> Trusted Sites    : Low
 -> Internet Options -> Security -> Internet         : Medium + unchecked Enable Protected Mode
 -> Internet Options -> Security -> Restricted Sites : unchecked Enable Protected Mode

Writing Test Cases:

You can write you own static test case generator for this fuzzer in python. You have to place it inside /TestCases folder. For your reference one sample is given here 'TestCases/SampleTestCase.py'. While writing test cases do remember, it should have a 'TestCase' class and 'getFinalTestCase()' method in it. This getFinalTestCase() method should return the entire html page. 

In case of dynamic fuzzer, attributes of different html elements extracted from object and fuzzed on the fly at runtime , since its a static fuzzer we can pre define html elements and their attributes our test case as python dict.

attr = {'CANVAS':['height','width','getContext', ... , ... , ... ]}

For this attribute list generation, one JavaScript application is provided here : MiscTools/Generate_Elements_Dict.html

Source Code:

Source code of IEFuzz is available for download @ my github page.

Licence:

This software is licenced under BEER WARE licence although the following libraries are included with 'IEFuzz' and are licensed separately.

Running This Fuzzer:


One video demo is available here, on how to run this fuzzer and reproduce crashes.



Happy Fuzzing :) :)

Tuesday, February 17, 2015

Walking Heap Using Pydbg

I'm a big fan of Pydbg. Although it has many awesome features , it also has few limitations. One of them is lack of control over process heap. For a long time I'm thinking of writing something which makes Heap Manipulation / Heap parsing / Traversing using pydbg little easier for reverse engineers. So finally last weekend I wrote couple of small py scripts which can parse Windows 7 process heaps on the fly.
In this blog post I'm going to share one of them.

This is the simplest implementation of HeapWalk() API based on pydbg. Heap walk API enumerates the memory blocks in the specified heap. If you are not very familiar with HeapWalk() API this page has a very good example in C++.

Right now best available tool available for heap analysis is windbg. The script I'm going to share  does something similar to windbg's "!heap -a 0xmyheaphandle" command.




You can use the function HeapWalk() [@ Line 103] as break point hander in your pydbg script. In below example actually I did something similar.

First I'm running an application (on 32 bit Windows 7) which uses user32!MessageBoxA API somewhere.

After that I'm attaching my pydbg script with that process and setting up a break point at user32!MessageBoxA and also setting up HeapWalk() as the breakpoint handler.

Now whenever the application will make a call to MessageBoxA api our breakpoint handler HeapWalk() will be invoked and it will start traversing all the available process heap and their segments.

Script 1:


The output of this script will be something similar: https://gist.github.com/debasishm89/1264d7a6726b9e910a5d

Since this script will give you addresses of all all heap blocks and their size, now you should have more control over process heap. You should be able to search for string/data / byets / pointer in process heaps very easily.

Thank you for reading. Hope you've enjoyed :)

Cheers,

Wednesday, January 28, 2015

qHooK - Not Just a Win32 API Hooking Script

Hello everyone. Hope every one is doing good. After a long gap I'm about to post something. Sometimes its easier to build something on your own, than finding something similar which has already been developed in the past. I'm not sure whether any script / tool already present in the wild which does the same, but I definitely needed a tool / script, which can reduce efforts of analysing unknown exploits/ shellcode. I developed this tool one and half years back to mainly analyse shellcodes / exploits etc etc. Obviously when I wrote this there was no name, I just given this a name "qHooK" before writing this post :)

So what it does ?

Its very simple and straight forward python script (dependent on pydbg) which hooks user defined Win32 APIs in any process and prepare a CSV report with various interesting information which can  help reverse engineer to track down / analyse unknown exploit samples / shellcode. Please refer to demo video.

qHooK Final CSV Report:





Video Demo(With Voice):

I guess I've become pretty lazy so I'm not going to write each and every thing about this script. Just adding a video demo (with voice) which explains few real life scenarios, where it may help you. Sorry about my weak voice. My laptop mic sucks :( cant help.



Source Code:

https://github.com/debasishm89/qHooK

Saturday, July 5, 2014

Releasing Stupid v0.1 - The Dumbest File Format Fuzzer (Python+Pydbg)

I developed Stupid in late 2011 to automate fuzzing and problem/app fault detection process of different file formats( mainly Music/Video players etc). I've been receiving many email from my readers asking me to release POC of a python + pydbg fuzzer. So today I'm very happy to make this small yet effective Fuzzer open to everyone. This is highly prototypal and I recommend to rewrite/modify the test case generator sub routine to make this fuzzer more effective.


Happy fuzzing guys. If you are lucky enough to find any zero day using this fuzzer, you can drop me a TY email or buy me a beer in return if we meet someday :)


Source Code:


Stupid source code is available @   https://github.com/debasishm89/Stupid

Licence:






This software is licenced under a Beerware licence although the following libraries are included with Stupid and are licensed separately.


  • pydbg
  • paimei - https://github.com/pedramamini/paimei

Running this Fuzzer:

Stupid was developed and tested with Win32 Python 2.7(x86). So it's recommended to use the same version of python. Also make sure pydbg(x86) is installed on the system.

You need to provide the target application binary path (.exe) and at least one base file to run this fuzzer. You can to modify the configuration section of "stupid.py" as per your requirement.

Test Case Generation:


mutate() routine is responsible for generating test cases from given bases files. It has two sub parts:
  • Bitflip
  • Random Byte Flip

You may want to change / modify these routines to make this fuzzer more effective. ;)

Monitoring:


To monitor target application for different types of crashes (access violation), Stupid uses pydbg(Python debugger).It also uses "utils" of https://github.com/pedramamini/paimei framework to collect crash information which can be used later to identify/distinguish interesting app crashes. Sample crash synopsis file is below,




Reproducing Crashes:


Crash files and crash information can be found in "Crashes" folder which can be used to reproduce app crashes.

Saturday, April 19, 2014

Attacking Audio "reCaptcha" using Google's Web Speech API

I had a fun project months back, Where I had to deal with digital signal processing and low level audio processing. I was never interested in DSP and all other control system stuffs, But when question arises about breaking things, every thing becomes interesting :) . In this post i'm going to share one technique to fully/ partially bypass reCaptcha test. This is not actually a vulnerability but its better if we call it "Abuse of functionality".

Disclaimer : Please remember this information is for Educational Purpose only and should not be used for malicious purpose. I will not assume any liability or responsibility to any person or entity with respect to loss or damages incurred from information contained in this article.

1. What is Captcha

A CAPTCHA is a program that protects websites against bots by generating and grading tests that humans can pass but current computer programs cannot. The term CAPTCHA (for Completely Automated Public Turing Test To Tell Computers and Humans Apart) was coined in 2000 by Luis von Ahn, Manuel Blum, Nicholas Hopper and John Langford of Carnegie Mellon University.


2. What is Re-captcha

reCAPTCHA is a free CAPTCHA service by Google, that helps to digitize books, newspapers and old time radio shows. More details can be found here.


3. Audio reCaptcha

reCAPTCHA also comes with an audio test to ensure that blind users can freely navigate.

4. Main Idea: Attacking Audio reCaptcha using Google's Web Speech API Service





5. Google Web Speech API

Chrome has a really interesting new feature for HTML5 speech input API. Using this user can talk to computer using microphone and Chrome will interpret it. This feature is also available for Android devices. If you are not aware of this feature you can find a live demo here.

https://www.google.com/intl/en/chrome/demos/speech.html

I was always very curious about the Speech recognition API of chrome. I tried sniff the api/voice traffic using Wireshirk but this API uses SSL. :(.

So finally I started browsing the Chromium source code repo. Finally I found exactly what I wanted.

http://src.chromium.org/viewvc/chrome/trunk/src/content/browser/speech/

It pretty simple, First the audio is collected from the mic, and then it posts it to Google web service, which responds with a JSON object with the results.  The URL which handles the request is :

https://www.google.com/speech-api/v1/recognize

Another important thing is this api only accepts flac audio format.

6. Programatically Accessing Google Web Speech API(Python)

Below python script was written to send a flac audio file to Google Web Speech API and print out the JSON response.

./google_speech.py hello.flac


'''
Accessing Google Web Speech API using Pyhon
Author : Debasish Mandal

'''

import httplib
import sys

print '[+] Sending clean file to Google voice API'
f = open(sys.argv[1])
data = f.read()
f.close()
google_speech = httplib.HTTPConnection('www.google.com')
google_speech.request('POST','/speech-api/v1/recognize?xjerr=1&client=chromium&lang=en-US',data,{'Content-type': 'audio/x-flac; rate=16000'})
print google_speech.getresponse().read()
google_speech.close()



7. Thoughts on complexity of reCaptcha Audio Challenges

While dealing with audio reCaptcha, you may know , it basically gives two types of audio challenges. One is pretty clean and simple (Example : https://dl.dropboxusercontent.com/u/107519001/easy.wav) . percentage of noise is very less in this type of challenges. 

Another one is very very noisy and its very difficult for even human to guess (Example : https://dl.dropboxusercontent.com/u/107519001/difficult.wav). Constant hissss noise and overlapping voice makes it really difficult to crack human. You may wanna read this discussion on complexity of audio reCapctha.

In this post I will mainly cover the technique / tricks to solve the easier one using Google Speech API. Although I've tried several approaches to solve the complex one, but as I've already said, its very very had to guess digits even for human :( .

8. Cracking the Easy Captcha Manually Using Audacity and Google Speech API

Google Re-captcha allows user to download audio challenges in mp3 format. And Google web speech API accepts audio in flac format. So if we just normally convert the mp3 audio challenge to flac format of frame rate 16000 its does not work :( .  Google Chrome Speech to text api does not respond to this sound.

But after some experiment and head scratching, it was found that we can actually make Google web speech api convert the easy captcha challenge to text for us, if we can process the audio challenge little bit. In this section i will show how this audio manipulation can be done using Audacity.

To manually verify that first I'm going to use a tool called Audacity to do necessary changes to the downloaded mp3 file. 

Step 1: Download the challenge as mp3 file.
Step 2: Open the challenge audio in Audacity.



Step 3: Copy the first digit speaking sound from main window and paste it in a new window. So here we will only have a one digit speaking sound.

Step 4: From effect option make it repetitive once. (Now It should speak the same digit twice).

Lets say for example if the main challenge is  7 6 2 4 6, Now we have only first digit challenge in wav format which having the digit 7 twice.





Step 5: Export the updated audio in WAV format.
Step 6: Now convert the wav file to flac format using sox tool and send it to Google speech server using the python script posted in section 6. And we will see something like this.

Note: In some cases little bit amplification might be required if voice strength is too low.

debasish@debasish ~/Desktop/audio/heart attack/final $ sox cut_0.wav -r 16000 -b 16 -c 1 cut_0.flac lowpass -2 2500
debasish@debasish ~/Desktop/audio/heart attack/final $ python send.py cut_0.flac 


Great! As you can see first digit of the audio challenge has been resolved by Google Speech. :) :) :) Now in same manner we can solve the entire challenge. In next section we will automate the same thing using python and it's wave module. 

9. Automation using Python and it's WAVE Module

Before we jump into processing of raw WAV audio using low level python API, its important to have some idea of how digital audio actually works. In above process we've extracted the most louder voices using audacity but to do it automatically using python, we must have some understanding of how digital audio is actually represented in numbers.

9.1. How is audio represented with numbers

There is an excellent stackoverflow post which explains the same. In short ,we can say audio is nothing but a vibration. Typically, when we're talking about vibrations of air between approximately 20Hz and 20,000Hz. Which means the air is moving back and forth 20 to 20,000 times per second. If somehow we can measure that vibration and convert it to an electrical signal using a microphone, we'll get an electrical signal with the voltage varying in the same waveform as the sound. In our pure-tone hypothetical, that waveform will match that of the sine function.

Now, we have an analogue signal, the voltage. Still not digital. But, we know this voltage varies between (for example) -1V and +1V. We can, of course, attach a volt meter to the wires and read the voltage.  Arbitrarily, we'll change the scale on our volt meter. We'll multiple the volts by 32767. It now calls -1V -32767 and +1V 32767. Oh, and it'll round to the nearest integer.

Now after having a set of signed integers we can easily draw an waveform using the data sets.

X axis -> Time
Y axis -> Amplitude (signed integers)



Now, if we attach our volt meter to a computer, and instruct the computer to read the meter 44,100 times per second. Add a second volt meter (for the other stereo channel), and we now have the data that goes on an audio CD. This format is called stereo 44,100 Hz, 16-bit linear PCM. And it really is just a bunch of voltage measurements.

9.2. WAVE File Format walk through using Python

As an example lets open up a very small wav file with a hex editor.

  

9.3. Parsing the same WAV file using Python

The wave module provides a convenient interface to the WAV sound format. It does not support compression/decompression, but it does support mono/stereo. Now we are going to parse the same wav file using python wave module and try to relate what we have just seen in hex editor.

Let's write a python script:

import wave 
f = wave.open('sample.wav', 'r') 
print '[+] WAV parameters ',f.getparams() 
print '[+] No. of Frames ',f.getnframes() 
for i in range(f.getnframes()): 
    single_frame = f.readframes(1) 
    print single_frame.encode('hex') 
f.close()

Line 1 imports python wav module.
Line 2: Opens up the sample.wav file.
Line 3: getparams() routine returns a tuple (nchannels, sampwidth, framerate, nframes, comptype, compname), equivalent to output of the get*() methods.
Line 4: getnframes() returns number of audio frames.
Line 5,6,7: Now we are iterating through all the frames present in the sample.wav file and printing them one by one.
Line 8: Closes the opened file

Now if we run the script we will find something like this:

[+] WAV parameters (1, 2, 44100, 937, 'NONE', 'not compressed')
[+] No. of Frames 937
[+] Sample 0 = 62fe    <- Sample 1
[+] Sample 1 = 99fe   <- Sample 2
[+] Sample 2 = c1ff    <- Sample 3
[+] Sample 3 = 9000
[+] Sample 4 = 8700
[+] Sample 5 = b9ff
[+] Sample 6 = 5cfe
[+] Sample 7 = 35fd
[+] Sample 8 = b1fc
[+] Sample 9 = f5fc
[+] Sample 10 = 9afd
[+] Sample 11 = 3cfe
[+] Sample 12 = 83fe
[+] ....
and so on,

It should make sense now. In first line we get number of channels, sample width , frame/sample rate,total number of frames etc etc. Which is exact same what we saw in the hex editor (Section 9.2). From second line it stars printing the frames/sample which is also same as what we have seen in hex editor. Each channel is 2 bytes long because the audio is 16 bit. Each channel will only be one byte. We can use the getsampwidth() method to determine this. Also, getchannels() will determine if its mono or stereo.

Now its time to decode each and every frames of that file. They're actually little-endian. So we will now modify the python script little bit so that we can get the exact value of each frame. We can use python struct module to decode the frame values to signed integers.

import wave 
import struct 

f = wave.open('sample.wav', 'r') 
print '[+] WAV parameters ',f.getparams() 
print '[+] No. of Frames ',f.getnframes() 
for i in range(f.getnframes()): 
    single_frame = f.readframes(1) 
    sint = struct.unpack('<h', single_frame) [0]
    print "[+] Sample ",i," = ",single_frame.encode('hex')," -> ",sint[0] 
f.close()

This script will print something like this:

[+] WAV parameters (1, 2, 44100, 937, 'NONE', 'not compressed')
[+] No. of Frames 937
[+] Sample 0 = 62fe -> -414
[+] Sample 1 = 99fe -> -359
[+] Sample 2 = c1ff -> -63
[+] Sample 3 = 9000 -> 144
[+] Sample 4 = 8700 -> 135
[+] Sample 5 = b9ff -> -71
[+] Sample 6 = 5cfe -> -420
[+] Sample 7 = 35fd -> -715
[+] Sample 8 = b1fc -> -847
[+] Sample 9 = f5fc -> -779
[+] Sample 10 = 9afd -> -614
[+] Sample 11 = 3cfe -> -452
[+] Sample 12 = 83fe -> -381
[+] Sample 13 = 52fe -> -430
[+] Sample 14 = e2fd -> -542

Now what we can see we have a set of positive and negative integers. Now you should be able to connect the dots. What I have explained in section 9.1. 

So now if we plot the same positive and negative values in a graph will find complete wave form. Lets do it using python matlab module.

import wave 
import struct 
import matplotlib.pyplot as plt 

data_set = [] 
f = wave.open('sample.wav', 'r') 
print '[+] WAV parameters ',f.getparams() 
print '[+] No. of Frames ',f.getnframes() 
for i in range(f.getnframes()): 
    single_frame = f.readframes(1)
    sint = struct.unpack('<h', single_frame)[0]
    data_set.append(sint) 
f.close() 
plt.plot(data_set) 
plt.ylabel('Amplitude')
plt.xlabel('Time') 
plt.show()

This should form following graph

Now you must be familiar with this type of graph. This is what you see in SoundCloud, But definitely more complex one.

So now we have clear understanding of how audio represented in numbers. Now it will be easier for readers to understand how the python script ( shared in section 9.3 ) actually works.

9.3. Python Script

In this section we will develop a script which automate the steps we did using Audacity in Section 8. Below python script will try extract loud voices from input wav file and generate separate wav files.



Once the main challenge is broken into parts we can easily convert it to flac format and send each parts of the challenge to Google speech API using the Python script shared in section 6.

9.4. Demo:



10. Attempt to Crack the Difficult(noisy) audio challenge

So we have successfully broken down the easy challenge.Now its time to give the difficult one a try. So I started with one noisy captcha challenge. You can see the matlab plot of the same noisy audio challenge below.

In above figure we can understand presence of a constant hisss noise. One of the standard ways to analyze sound is to look at the frequencies that are present in a sample. The standard way of doing that is with a discrete Fourier transform using the fast Fourier transform or FFT algorithm. What these basically in this case is to take a sound signal isolate the frequencies of sine waves that make up that sound.

10.1. Signal Filtering using Fourier Transform

Lets get started with a  simple example. Consider a signal consisting of a single sine wave, s(t)=sin(w∗t). Let the signal be subject to white noise which is added in during measurement, Smeasured(t)=s(t)+n. Let F be the Fourier transform of S. Now by setting the value of F to zero for frequencies above and below w, the noise can be reduced. Let Ffiltered be the filtered Fourier transform. Taking the inverse Fourier transform of Ffiltered yields Sfiltered(t). 

The way to filter that sound is to set the amplitudes of the fft values around X Hz to 0. In addition to filtering this peak, It's better to remove the frequencies below the human hearing range and above the normal human voice range. Then we recreate the original signal via an inverse FFT.

I have written couple of scripts which successfully removes the constant hiss noise from the audio file but main challenge is the overlapping voice. Over lapping voice makes it very very difficult even for human to guess digits. Although I was not able to successfully crack any of given difficult challenges using Google Speech API still I've shared few noise removal scrips (using Fourier Transform). 

These scripts can be found in the GitHub project page. There is tons of room for improvement of all this scripts.

11. Code Download

Every code I've written during this project is hosted here:  

12. Conclusion

When I reported this issue to Google security team, they've confirmed that, this mechanism is working as intended. The more difficult audio patterns are only triggered only when abuse/non-human interaction is suspected. So as per the email communication noting is going to be changed to stop this.

Thanks for reading. I hope you have enjoyed. Please drop me an email/comment in case of any doubt and confusion.

13. References

http://rsmith.home.xs4all.nl/miscellaneous/filtering-a-sound-recording.html
http://www.topherlee.com/software/pcm-tut-wavformat.html
http://exnumerus.blogspot.in/2011/12/how-to-remove-noise-from-signal-using.html
http://www.swharden.com/blog/2009-01-21-signal-filtering-with-python/