How to build Tesseract on Cygwin

by Paul Vorbach, 2014-02-20

Tesseract is the most accurate and most adaptable open source OCR engine I know of.

For my master thesis, I needed to be able to change the inner workings of Tesseract. That’s why I had to compile it.

Cygwin is a set of GNU tools for Microsoft Windows which gives you a POSIX environment on Windows.

Here, I’ll document how to build Tesseract on Cygwin, because that is easier than building on MinGW or in Visual Studio and it is not documented on the Compiling wiki page.

Installing Cygwin

Download Cygwin from the download page (both 32-bit and 64-bit versions will work).
Run the installer.
Use C:\Cygwin or C:\Cygwin64 as root directory.
When you are asked to select the desired packages, set Base, Devel and Graphics to Install. You can Skip at least Publishing, Gnome and KDE, probably even more, in order to save time during installation. Leave all other packages at Default.
Continue the installation process until you are done.

Installing Leptonica

In order to build tesseract, we need to build Leptonica first.

Open a Cygwin Terminal.
Create a directory, where you can build the library.
```
mkdir -p /opt/src && cd /opt/src
```

Get the source.

wget http://www.leptonica.org/source/leptonica-1.70.tar.gz

Or use the latest source package from Leptonica’s downloads page. Extract it.

tar -xvf leptonica-1.70.tar.gz
cd leptonica-1.70

Since giflib is not available in Cygwin, we have to configure it accordingly.
```
./configure --without-giflib
```
Build and install it.
```
make
make install
make clean
```

Installing Tesseract

Go to /opt/src.
```
cd /opt/src
```

Download Tesseract’s latest source distribution from here.

wget https://tesseract-ocr.googlecode.com/files/tesseract-ocr-3.02.02.tar.gz
tar -xvf tesseract-ocr-3.02.02.tar.gz
cd tesseract-ocr

Configure it.

./autogen.sh
./configure LDFLAGS=-L/usr/local/lib

Build and install it.
```
make
make install
make clean
```

Verification

All training files have to be in /usr/local/share/tessdata. Download a language data archive file from Tesseracts downloads page, extract it and move its contents to /usr/local/share/tessdata. You can also train your own language data. Then you’ll be able to run Tesseract.

tesseract -l eng input.tif output

This will create a output.txt file with the OCR results.

If something is not working for you, leave a comment.