• OCR image to text

    From Bill Powell@21:1/5 to All on Sun Jul 14 02:49:34 2024
    I have a series of one-page images that are really images and not text even though they look like they're just a page of simple text in the same font.

    Is there a way to easily OCR a PDF to actual text on Windows for free?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Peter@21:1/5 to Geoff on Sun Jul 14 03:03:21 2024
    Geoff <geoff@geoffwood.org> wrote:

    Is there a way to easily OCR a PDF to actual text on Windows for free?

    https://letmegooglethat.com/?q=free+ocr+to+pdf

    geoff

    You've never actually run that search, have you?
    If you did, you'd know all you'll get are advertising shills.
    All of which are online PDF converters which are huge privacy scams.

    As far as I am aware, there is only one free Windows OCR converter extent. That's GNU OCR (GOCR, aka JOCR) https://jocr.sourceforge.net/

    The gocr help just says it works on "pnm,pgm,pbm,ppm,pcx..." files. https://jocr.sourceforge.net/examples.html https://www-e.ovgu.de/jschulen/ocr/download.html
    "Windows-binary gocr049.exe" v0.49 154kB by Peter B L Meijer, Oct 2010 http://www-e.uni-magdeburg.de/jschulen/ocr/gocr049.exe
    Name: gocr049.exe
    Size: 153600 bytes (150 KiB)
    SHA256: 1FFC4CD29A5B275F40FBC5F6F9194ED72B8D2BCCBD46019F088C9E5DE2923F59

    gocr049.exe
    Optical Character Recognition --- gocr 0.49 20100924
    Copyright (C) 2001-2010 Joerg Schulenburg GPG=1024D/53BDFBE3
    released under the GNU General Public License
    use option -h for help

    gocr049.exe -h
    Optical Character Recognition --- gocr 0.49 20100924
    Copyright (C) 2001-2010 Joerg Schulenburg GPG=1024D/53BDFBE3
    released under the GNU General Public License
    using: gocr [options] pnm_file_name # use - for stdin
    options (see gocr manual pages for more details):
    -h, --help
    -i name - input image file (pnm,pgm,pbm,ppm,pcx,...)
    -o name - output file (redirection of stdout)
    -e name - logging file (redirection of stderr)
    -x name - progress output to fifo (see manual)
    -p name - database path including final slash (default is ./db/)
    -f fmt - output format (ISO8859_1 TeX HTML XML UTF8 ASCII)
    -l num - threshold grey level 0<160<=255 (0 = autodetect)
    -d num - dust_size (remove small clusters, -1 = autodetect)
    -s num - spacewidth/dots (0 = autodetect)
    -v num - verbose (see manual page)
    -c string - list of chars (debugging, see manual)
    -C string - char filter (ex. hexdigits: 0-9A-Fx, only ASCII)
    -m num - operation modes (bitpattern, see manual)
    -a num - value of certainty (in percent, 0..100, default=95)
    -u string - output this string for every unrecognized character
    examples:
    gocr -m 4 text1.pbm # do layout analyzis
    gocr -m 130 -p ./database/ text1.pbm # extend database
    djpeg -pnm -gray text.jpg | gocr - # use jpeg-file via pipe

    webpage: http://jocr.sourceforge.net/

    When I tested it just now, it worked but it's prone to spelling errors
    even on perfectly good text so, while it works, it doesn't work well.

    a. I couldn't get gocr to convert a docx or pdf to anything
    gocr049.exe -i "testpage.docx" -o testpage.txt -f UTF8
    b. Then I couldn't get imagemagic to convert pdf to anything
    convert testpage.pdf testpage.pnm
    c. So I saved the testpage.pdf to testpage.png to convert by imagemagick
    convert testpage.png testpage.pnm
    d. gocr049.exe -i "testpage.pnm" -o testpage.txt -f UTF8
    (it had a tremendous amount of spelling errors, but it worked)

    As far as I'm aware, there is no other Windows OCR freeware extent.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Geoff@21:1/5 to Bill Powell on Sun Jul 14 13:32:09 2024
    On 14/07/2024 12:49 pm, Bill Powell wrote:
    I have a series of one-page images that are really images and not text even though they look like they're just a page of simple text in the same font.

    Is there a way to easily OCR a PDF to actual text on Windows for free?


    https://letmegooglethat.com/?q=free+ocr+to+pdf

    geoff

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Abandoned Trolley@21:1/5 to Bill Powell on Sun Jul 14 06:26:41 2024
    On 14/07/2024 01:49, Bill Powell wrote:
    I have a series of one-page images that are really images and not text even though they look like they're just a page of simple text in the same font.

    Is there a way to easily OCR a PDF to actual text on Windows for free?


    Theres a OCR reader / converter thing built in to MS Word (it might be
    called MS Lens?) if its any help, but a clear explanation of what you
    have and what you want might be more useful.

    You say you have a series of one page images, which I assume are digital
    files and not bits of paper ?

    If they are images, then they might be jpegs or something, but you dont say.


    Or. they might be a .pdf, as you say you want to "easily OCR a PDF to
    actual text"

    If they are .pdf then simply cut and paste ?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Geoff@21:1/5 to Peter on Sun Jul 14 18:56:55 2024
    On 14/07/2024 2:03 pm, Peter wrote:
    Geoff <geoff@geoffwood.org> wrote:

    Is there a way to easily OCR a PDF to actual text on Windows for free?

    https://letmegooglethat.com/?q=free+ocr+to+pdf

    geoff

    You've never actually run that search, have you?
    If you did, you'd know all you'll get are advertising shills.
    All of which are online PDF converters which are huge privacy scams.

    You didn't specify 'no online converters'. You think this one (found in
    the search) is a privacy scam, or are you worriedabout potentially
    exposing something very sensitive in the 'online' scenario ?

    https://www.adobe.com/acrobat/online/ocr-pdf.html

    geoff

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Alan Browne@21:1/5 to Bill Powell on Sun Jul 14 08:15:21 2024
    On 2024-07-13 20:49, Bill Powell wrote:
    I have a series of one-page images that are really images and not text even though they look like they're just a page of simple text in the same font.

    Is there a way to easily OCR a PDF to actual text on Windows for free?

    There are plenty of free online converters - of course you're exposing
    content to a third party. Be mindful of what is in the doc.


    --
    "It would be a measureless disaster if Russian barbarism overlaid
    the culture and independence of the ancient States of Europe."
    Winston Churchill

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Geoff Realname@21:1/5 to Bill Powell on Tue Jul 23 12:15:00 2024
    On 14/07/2024 01:49, Bill Powell wrote:
    I have a series of one-page images that are really images and not text even though they look like they're just a page of simple text in the same font.

    Is there a way to easily OCR a PDF to actual text on Windows for free?

    Late to the party here, but what about FreeOCR? http://www.paperfile.net/index.html
    It's a bit ancient, but it certainly works well for my needs. Also,
    though I haven't tried it, IrfanView includes OCR capabilities https://www.irfanview.com/
    --
    I would be unstoppable if I could get started.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Abandoned Trolley@21:1/5 to Geoff Realname on Tue Jul 23 16:42:52 2024
    On 23/07/2024 12:15, Geoff Realname wrote:
    On 14/07/2024 01:49, Bill Powell wrote:
    I have a series of one-page images that are really images and not text
    even
    though they look like they're just a page of simple text in the same
    font.

    Is there a way to easily OCR a PDF to actual text on Windows for free?

    Late to the party here, but what about FreeOCR? http://www.paperfile.net/index.html
    It's a bit ancient, but it certainly works well for my needs. Also,
    though I haven't tried it, IrfanView includes OCR capabilities https://www.irfanview.com/


    As I said about a week ago, theres no clear explanation from the OP of
    what he has and what he wants.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)