找回密码
 注册
搜索
热搜: 超星 读书 找书
查看: 837|回复: 3

[【推荐】] Acrobat9的OCR新选项——clearscan

[复制链接]
发表于 2010-2-10 09:08:41 | 显示全部楼层 |阅读模式
Better PDF OCR. ClearScan is smaller, looks better
Optical Character Recognition (OCR) converts scanned paper documents into searchable PDF documents. This technology has been available in Acrobat for about ten years.
While OCR accuracy and language support have improved over the years, the default OCR \"flavor\"— Searchable Image— was the only useful choice.
Searchable Image retains the underlying scanned image and adds an invisible layer of text on top which may be selected:

Searchable Image OCR has some shortcomings:
    [li]File Size
    For 300 dpi black and white scans, a typical file size is 15-40K per page. Scanning at higher resolutions (600 dpi Vs. 300 dpi) increases file size about three to four times. [/li][li]Print Speed
    Because of the image-heavy content, searchable image PDFs can take a long time to print. [/li][li]Visual Quality
    At 300 dpi, scanned documents are easily distinguishable in quality from computer-generated files. [/li]
In Acrobat 9, Adobe engineers added a new flavor of OCR called ClearScan. ClearScan offers improved text quality with a decrease in file size:

I've recently completed some benchmarking which shows dramatic file size decreases and quality gains. Read on to learn about size comparisons, how to use ClearScan OCR and a bit more about how it all works.

Testing Methodology
I created two test documents:
    [li]78-page image-only PDF document scanned at 300 dpi [/li][li]78-page image-only PDF document scanned at 600 dpi [/li]
I ran OCR and compared file sizes on my ThinkPad W500. The test machine ran Vista Enterprise in 32-bit mode and has 4GB of RAM. In addition to Acrobat, I also had Excel running. The W500 is a current model laptop which runs an Intel Core 2 Duo CPU at 2.8 GHz. The test machine has an IBM standard 320GB laptop hard drive running at 7200 rpm.
Visual Results and Total File Size

Timing and File Sizes

300 DPI Test | 78 Pages

File Size
Size per Page
Searchable
Process Time (sec)
Seconds per Page
300 dpi Image only PDF
1.13MB
15.1K
No
n/a
n/a
300 dpi Searchable Image
1.24MB
15.90
Yes
84
1.09
300 dpi ClearScan
401K
5.14
Yes
85
1.09


600 DPI Test | 78 Pages

File Size
Size per Page
Searchable
Process Time (sec)
Seconds per Page
600 dpi Image 0nly PDF
3.44 MB
44.10
No
n/a
n/a
600 dpi Searchable Image
3.52 MB
45.13
Yes
319
4.10
600 dpi ClearScan
477K
6.12
Yes
320
4.10
Note: Numbers rounded.


At 300 dpi, ClearScan offered improved visual quality at about one-third the total file size. At 600 dpi, the ClearScan file was seven times smaller and looked better.
How does ClearScan work?
ClearScan works by turning the images which represent text characters on the page into smoothed vector outlines. Each character on the page is compared and all matching characters are replaced with a an outline character:
Original
ClearScan
800% View in Acrobat
300 dpi scan

ClearScan does not replace the font with your system fonts. Rather, a custom font it is created to match the visual appearance of the pixels.
In fact, if you run ClearScan OCR and choose File—> Document Properties and click on the Fonts tab, you'll see that custom fonts are created:

Besides better visual appearance, print time is reduced. Instead of sending large images to the printer, Acrobat can send the compact font information instead.
How can I try ClearScan OCR?
ClearScan OCR is not the default in Acrobat 9, so you'll need to change a setting to use it. Here's how.
    [li]Choose: Document—> OCR Text Recognition—> Recognize Text using OCR . . . [/li][li]Click the Edit . . . button in the OCR window:

    [/li][li]Change the PDF Output Style to ClearScan.

    [/li][li]Click OK twice to OCR the document. [/li]
Note: The setting is \"sticky\" for future sessions.

ClearScan Q&A
Here are a few answers to the most common questions about ClearScan OCR.
    [li]Is OCR accuracy any different between ClearScan and Searchable Image styles?
    No. The accuracy will be identical for input files of the same dpi. However, since a ClearScan files are so much smaller, you might consider using a 600 dpi input file as a starting point since there is little downside other than processing time. [/li][li]Can I make changes to the text in a ClearScan file?
    No. The Touchup Text Tool does not currently work on ClearScan files. [/li][li]Are ClearScan files admissible in court?
    To our knowledge, there has never been a challenge. ClearScan files are generally an accurate representation of the original document. [/li][li]What fonts does ClearScan use?
    ClearScan creates a custom font to match the character shape. It does not rely on system fonts or any other font that may be installed on your system. [/li][li]Are ClearScan files always smaller?
    For single page typical legal documents, you may not see much difference in file size. Multiple page documents will show the greatest file size reduction. Documents which vary greatly in the type and number of fonts will show less return on file size. [/li][li]Is ClearScan slower than Searchable Image OCR?
    No. The speed is virtually identical. [/li][li]What is the benefit of scanning at 600 dpi then using ClearScan?
    Since ClearScan bases its shapes on the original image, a 600 dpi input file yields better looking text than one scanned at 300 dpi:
    300 dpi input file
    ClearScan 800% view
    600 dpi input file
    ClearScan 800% view
    [/li][li]Can I re-OCR a Searchable Image file to turn it into a ClearScan file?
    Yes. [/li][li]Can I turn a ClearScan file into a Searchable Image file?
    No. This will trigger a \"renderable text\" error. You could export it to TIFF, reassemble and then OCR or print it to an image file. [/li]
Posted by Rick Borstein at 7:43 AM on May 5, 2009

http://blogs.adobe.com/acrolaw/2009/05/better_pdf_ocr_clearscan_is_smal.html

理由:看上去很好哟!
回复

使用道具 举报

发表于 2010-4-4 15:09:07 | 显示全部楼层
但是不支持中文ocr吧
回复

使用道具 举报

发表于 2010-4-4 21:59:42 | 显示全部楼层
引用第1楼lidaweildw于2010-04-04 15:09发表的 :
但是不支持中文ocr吧


ACROBAT9支持中文OCR的
回复

使用道具 举报

发表于 2010-4-4 22:04:59 | 显示全部楼层
clearscan处理得到的页面山上去和原来的页面还是很不一样,对于公式符号比较多的页面还是怕失真啊
还是searchable image比较好一点
另外,clearscan模式处理后,原本透明背景的页面,有时会出现一块一块的白色框框背景,可能是adobe的ocr引擎认为该处是无法OCR的块
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 注册

本版积分规则

Archiver|手机版|小黑屋|网上读书园地

GMT+8, 2024-5-18 19:57 , Processed in 0.288402 second(s), 5 queries , Redis On.

Powered by Discuz! X3.5

© 2001-2024 Discuz! Team.

快速回复 返回顶部 返回列表