[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Sheflug] Convert JPEG to Acsii



On 18-Nov-07 18:24:48, Richard Ibbotson wrote:
> 
>> when you say in them, do you mean in the image or do you
>> mean as meta info?
> 
> Well, (more info) I've used pdf2html to convert a PDF of some
> pages that I have written for a magazine. This has produced
> JPEGs of the pages that are linked to an HTML page.

If you have the PDF file[s], then Acrobat reader allows you to use
the mouse to copy displayed text after you open the file. E.g.

  acroread somefile.pdf

Then click on the button labelled "T" in the top menu-bar,
Then the hand-shaped cursor changes to an I-shaped one,
and you can used it (hold down left mouse button) in the
usual way to copy text from one window to another -- say you
have opened a text editor in the second window.

One point to watch: In PDF, consecutive text is not always
stored consecutively in the file, and you may well find that
highlighting too many lines with the mouse will highlight
a "consecutive text" block which is not really consecutive.

For example, if your pages have the text in two columns,
flowing down each column, you may find that when you highlight
the first two lines of a paragraph in the left-hand column,
they will copy across with a line from the second column
between them. You can see that this is going to happen by
the way the highlighting builds up as you sweep the mouse.

The only way round this that I know is to do it one line
at a time. (It can hapen with tables, too; and also in more
unexpected ways). This is tedious, but it works (I note that
you refer to "some pages that I wrote", so maybe it would
not take too long).

> I'd like to extract the text from the JPEGs of the pages so
> that I don't have to completely re-write all of it. If I
> can do something like copy and paste I can then update the
> report that I wrote and change it to something else so that
> I am not breaking anything to do with copyright on the part
> of someone else. If I change it sufficiently then it won't
> look anything like the original finished feature that I
> produced for a magazine.

There's no way you can "copy and paste" text from a JPEG
which is displaying text, since the file is simply a
bit-mapped image of something, and contains no internal
information whatever that it is an image of text. On the
other hand, a PDF file knows very well what is text and
what is image (which is why you can copy-&-paste it).

The xocr suggestion is one way to go, if you have to use
the JPEG. Xocr now works much better than it used to (for
years it was a close approximation to no good at all).

Even here, though, I'd feel inclined to start by using
say ImageMagick or the GIMP to blank out any non-text
material in a copy of the file, since this will save
xocr having to decide whether it san make anything of
that material. Also, different OCR packages may give
best results with input files in particular formats.

There's a review of sundry OCR packages available for
Linux (with readers' comments) at:

  http://groundstate.ca/ocr

OCR programs reviewed are:
Gocr, Clara OCR, Ocre, Ocrad, Tesseract, Ocropus, Aspire OCR,

Scores:

Name:   gocr
Location:       http://jocr.sourceforge.net/    
Version:        0.44
Input Format:   pnm
Accuracy:       94%
Ease of Use:    4/5

Name:   Clara OCR
Location:       http://www.geocities.com/claraocr/
Version:        20031214
Input Format:   pbm
[Accuracy not evaluated]
Easy of Use:    0/5

Name:   ocre
Location:       http://lem.eui.upm.es/ocre.html
Version:        0.026
Input Format:   pgm/pbm
[Accuracy not evaluated]
Easy of Use:    3

Name:   Ocrad
Location:       http://www.gnu.org/software/ocrad/ocrad.html
Version:        0.15
Input Format:   pbm/pgm
Accuracy:       97%
Easy of Use:    4/5

Name:   Tesseract
Location:       http://code.google.com/p/tesseract-ocr/
Version:        1.04b
Input Format:   tiff
Accuracy:       99%
Easy of Use:    2/5
["While Tesseract is the least user-friendly of the command
 line applications, it is by far the most accurate, the most
 active, and the most promising."]

Name:   Ocropus
Location:       http://code.google.com/p/ocropus/
Version:        svn (20070523)
Input Format:   many
Accuracy:       99%
Easy of Use:    1/5

Name:   Aspire OCR
Location:       http://asprise.com/product/ocr/index.php?lang=java
Version:        3.0
Input Format:   tiff/pdf
Accuracy:       91.5%
Easy of Use:    3/5


There's also at least one commercial OCR package for Linux:

  http://www.vividata.com/index.html

I've used it, and it's good (I often got accuracies will over 99%,
i.e. error rates down at 0.1% or thereabouts).

Hoping this helps!
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <ted.harding@xxxxxxxxxxxxxxxx>
Fax-to-email: +44 (0)870 094 0861
Date: 18-Nov-07                                       Time: 19:16:33
------------------------------ XFMail ------------------------------

_______________________________________________
        Sheffield Linux User's Group
  http://www.sheflug.org.uk/mailfaq.html
 GNU - The choice of a complete generation