How to build Tesseract on Cygwin
Tesseract is the most accurate and most adaptable open source OCR engine I know of.
For my master thesis, I needed to be able to change the inner workings of Tesseract. That’s why I had to compile it.
Cygwin is a set of GNU tools for Microsoft Windows which gives you a POSIX environment on Windows.
Here, I’ll document how to build Tesseract on Cygwin, because that is easier than building on MinGW or in Visual Studio and it is not documented on the Compiling wiki page.
Installing Cygwin
- Download Cygwin from the download page (both 32-bit and 64-bit versions will work).
- Run the installer.
- Use
C:\CygwinorC:\Cygwin64as root directory. - When you are asked to select the desired packages, set Base, Devel and Graphics to Install. You can Skip at least Publishing, Gnome and KDE, probably even more, in order to save time during installation. Leave all other packages at Default.
- Continue the installation process until you are done.
Installing Leptonica
In order to build tesseract, we need to build Leptonica first.
- Open a Cygwin Terminal.
Create a directory, where you can build the library.
mkdir -p /opt/src && cd /opt/srcGet the source.
wget http://www.leptonica.org/source/leptonica-1.70.tar.gzOr use the latest source package from Leptonica’s downloads page. Extract it.
tar -xvf leptonica-1.70.tar.gz cd leptonica-1.70Since
giflibis not available in Cygwin, we have to configure it accordingly../configure --without-giflibBuild and install it.
make make install make clean
Installing Tesseract
Go to
/opt/src.cd /opt/srcDownload Tesseract’s latest source distribution from here.
wget https://tesseract-ocr.googlecode.com/files/tesseract-ocr-3.02.02.tar.gz tar -xvf tesseract-ocr-3.02.02.tar.gz cd tesseract-ocrConfigure it.
./autogen.sh ./configure LDFLAGS=-L/usr/local/libBuild and install it.
make make install make clean
Verification
All training files have to be in /usr/local/share/tessdata. Download a language data archive file from Tesseracts downloads page, extract it and move its contents to /usr/local/share/tessdata. You can also train your own language data. Then you’ll be able to run Tesseract.
tesseract -l eng input.tif output
This will create a output.txt file with the OCR results.
If something is not working for you, leave a comment.