How to build Tesseract on Cygwin
Tesseract is the most accurate and most adaptable open source OCR engine I know of.
For my master thesis, I needed to be able to change the inner workings of Tesseract. That’s why I had to compile it.
Cygwin is a set of GNU tools for Microsoft Windows which gives you a POSIX environment on Windows.
Here, I’ll document how to build Tesseract on Cygwin, because that is easier than building on MinGW or in Visual Studio and it is not documented on the Compiling wiki page.
Installing Cygwin
- Download Cygwin from the download page (both 32-bit and 64-bit versions will work).
- Run the installer.
- Use
C:\Cygwin
orC:\Cygwin64
as root directory. - When you are asked to select the desired packages, set Base, Devel and Graphics to Install. You can Skip at least Publishing, Gnome and KDE, probably even more, in order to save time during installation. Leave all other packages at Default.
- Continue the installation process until you are done.
Installing Leptonica
In order to build tesseract, we need to build Leptonica first.
- Open a Cygwin Terminal.
Create a directory, where you can build the library.
mkdir -p /opt/src && cd /opt/src
Get the source.
wget http://www.leptonica.org/source/leptonica-1.70.tar.gz
Or use the latest source package from Leptonica’s downloads page. Extract it.
tar -xvf leptonica-1.70.tar.gz cd leptonica-1.70
Since
giflib
is not available in Cygwin, we have to configure it accordingly../configure --without-giflib
Build and install it.
make make install make clean
Installing Tesseract
Go to
/opt/src
.cd /opt/src
Download Tesseract’s latest source distribution from here.
wget https://tesseract-ocr.googlecode.com/files/tesseract-ocr-3.02.02.tar.gz tar -xvf tesseract-ocr-3.02.02.tar.gz cd tesseract-ocr
Configure it.
./autogen.sh ./configure LDFLAGS=-L/usr/local/lib
Build and install it.
make make install make clean
Verification
All training files have to be in /usr/local/share/tessdata
. Download a language data archive file from Tesseracts downloads page, extract it and move its contents to /usr/local/share/tessdata
. You can also train your own language data. Then you’ll be able to run Tesseract.
tesseract -l eng input.tif output
This will create a output.txt
file with the OCR results.
If something is not working for you, leave a comment.