tesseract-ocr is an OCR engine originally developed by Hewlett Packard and now sponsored by Google. It is highly accurate and will read a binary, gray, or color image and output text.
Free Open Source Document Management System
go-ocr is a tool for extracting plain text from scanned documents in pdf or djvu formats, and postprocessing of the text using user-defined rewriting rules to remove OCR artefacts and irregularities.
OCRPDF is a tool for extracting plain text from scanned documents and postprocessing of the text using user-defined rewriting rules to remove OCR artefacts and irregularities.
OCRmyPDF adds an inisible text layer to PDF documents after passing it through the Tesseract OCR engine. The output will be PDF/A with a selectable but invisible text layer above scanned image-documents. This allows later searching and archiving.
OCRFeeder is a document layout analysis and optical character recognition system. Given the images it will automatically outline its contents, distinguish between what's graphics and text and perform OCR over the latter. It generates multiple formats being its main one ODT. It features a complete GTK graphical user interface that allows the users to correct any unrecognized characters, defined or correct bounding boxes, set paragraph styles, clean the input images, import PDFs, save and lo