Breaking simple CAPTCHA using Tesseract with Java

CPATCHA is very popular feature for putting login or any action more secure as well as basic brute force attack prevention. However, sometimes some automated test also required Developers to break the CAPTCHA as well. Hence, I will introduce the very popular opensource OCR library which available in Java as well as other programming languages called “tesseract”.

This tutorial is just using the very basic method of OCR ( Optical Characteristic Recognition )which allow us to transform the image into text ( digit or alphabet ). In order to make the result more accurate we need to clean the unwanted image and put the image into Tesseract4J library.

Sample CAPTCHA from internet

Let bootstrapping Java Project using Gradle then we can add Tesseract4J as dependency library as below:

We divided the functionalities into three main parts as below :

  1. Cleaning the CAPTCHA image and produce another more readable CAPTCHA image.
  2. Transform CAPTCHA image into text by using Tesseract4J Library.
  3. Cleaning the text result into final result since the image may produce unwanted characters.

1 . Cleaning the CAPTCHA image

2. Transform CAPTCHA image into text

3. Final process is to clean the result

Running the above combined functions as one main console application lead to this result : “d99t26”

Running with result on console application

Cool !, now we are done with breaking very simple CAPTCHA image into text. So we can perform some automated tasks such as auto-login integration test or any other thing else like stress test. The CAPTCHA itself is also not much complicated that we can break it using above function. Other complicated CAPTHAs requires strong and complex advance image processing even machine learning to perform this operation.

Here is the complete code to the above demonstration :