Converting Scanned Documents to Text Using OCR

Published: 19th April 2011
Views: N/A
Ask About This Article Print
At some point of time we all have a scanned document, and we require the text but do not want to retype it. The solution is OCR – Optical Character Recognition technique. The problem is that all OCR software are not equally effective. And the ones which are very effective are not free. But amongst the free ones the one which we feel stands out is SimpleOCR. Its Character Recognition is very effective and is very simple to use. It can be downloaded from here.

Note: The free version of SimpleOCR can read only machine-print, not handwritten documents.



The installation is pretty straight forward. After installing open SimpleOCR from the start menu. The popup will ask you to choose between Machine Print and Hand Writing. Choose Machine Print and also select any profile in the other pop up box.



Now the SimpleOCR screen should look something like this:



SimpleOCR window

Now to begin the conversion to text:



Step 1: Select Add Page.



(If your document is not clear and has many stray marks it makes the whole process cumbersome. Try and remove any unrequired marks using any photo-editing software like MS Paint before loading the files. Or else, skip this for now. We will take care of it in Step 4.)




Step 2: Then choose from one of the options and load the document (Note: SimpleOCR supports .jpg, .tif and .bmp formats only).



Step 3: Choose Continue in the Preview Scanned Image window.



Step 4: Now in the SimpleOCR window, you would be able to see the document. Choose the yellow icon from the toolbar (as highlighted below). It is the icon to ignore regions of the document. Now select areas you do not need, and areas with stray marks. The light blue coloured areas are the ones I have ignored.

It is always a good idea to ignore bullets, lines, and special characters too. They end up confusing the computer. Also notice the top and bottom parts of the page. They usually have a black line. Ignore these parts too.



Ignoring regions

Step 5: Choose Convert to text on the right side of the toolbar. Give it few seconds.



Step 6: Now the image would be converted to text to the extent SimpleOCR can automatically do. Now, it is time for the correction work. In the bottom half of the window is the text form of the scanned document above.




Now click on words in the bottom half and automatically the image of the word is highlighted as well. Now there are a few things you can do:

1. You can type the correct characters in place of the wrong characters.

2. Select Keep as image to keep the original word in the image format.

3. Select Merge to merge the selected word with the next word.

4. If you want you can use the keyboard shortcuts (Alt+x where x is the underlined character), and the focus keeps moving on the the next word.



Correction of mistake in Recognition



Step 7: Click on Save as and choose file format and name and Save the document.



Hope this guide can help you. If you require there are other commercial alternatives like I.R.I.S. and ABBY which are much more accurate and can even recognise the formatting. But for general use SimpleOCR is as good as any other. Comment and let us know about any other effective OCR software.

This article is copyright
Source: http://jaimin.articlealley.com/converting-scanned-documents-to-text-using-ocr-2193328.html


Report this article Ask About This Article Print


Loading...
More to Explore
 


Ask a Professional Online Now
27 Experts are Online. Ask a Question, Get an Answer ASAP.
Type your question here...
Optional:
Select...