Breaking simple CAPTCHA using Tesseract with Java
CPATCHA is very popular feature for putting login or any action more secure as well as basic brute force attack prevention. However, sometimes some automated test also required Developers to break the CAPTCHA as well. Hence, I will introduce the very popular opensource OCR library which available in Java as well as other programming languages called “tesseract”.
This tutorial is just using the very basic method of OCR ( Optical Characteristic Recognition )which allow us to transform the image into text ( digit or alphabet ). In order to make the result more accurate we need to clean the unwanted image and put the image into Tesseract4J library.
Let bootstrapping Java Project using Gradle then we can add Tesseract4J as dependency library as below:
We divided the functionalities into three main parts as below :
- Cleaning the CAPTCHA image and produce another more readable CAPTCHA image.
- Transform CAPTCHA image into text by using Tesseract4J Library.
- Cleaning the text result into final result since the image may produce unwanted characters.
1 . Cleaning the CAPTCHA image
Using very simple grouping of pixel as density of Character to clean the background unwanted image. Any pixel with white color considered as background while the other colors we consider it the CAPTCHA text or unwanted background . To eliminated unwanted background color we will calculate the related HEIGHT or WIDTH whether it’s kind of SQUARE or not ( 8 x 8 Pixel ) . This method , will eliminate the unwanted slim color background as the related color dense is very small. Below is the detail code that speaks clearly for this simple algorithm
2. Transform CAPTCHA image into text
Using Tesseract4J we need to define path of data. We can install the Tessercact library in our PC then set the path of data directly. Then we can put the BufferedImage as the source of method to generate the text based on OCR capability .Sample code mentioned below:
3. Final process is to clean the result
As the image produce some other characters rather than digit or alphabet, we need to clean the result based on the output text. Below is the simple loop through characters of the string:
Running the above combined functions as one main console application lead to this result : “d99t26”
Cool !, now we are done with breaking very simple CAPTCHA image into text. So we can perform some automated tasks such as auto-login integration test or any other thing else like stress test. The CAPTCHA itself is also not much complicated that we can break it using above function. Other complicated CAPTHAs requires strong and complex advance image processing even machine learning to perform this operation.
Here is the complete code to the above demonstration : https://github.com/engleangs/catpchachabreaker