Raster to Text OCR Command Line

How to convert scanned PDF to searchable PDF?

    There is a kind of PDF file which is created by sending Office files, images, etc. to an Acrobat like PDF printer and those created by scanning physical paper like pages of a book, legal documents, etc. Normally speaking, those kinds of PDF file can not be edited let alone extract text from it. This feature will cause some using problem when you need to reuse the content in scanned PDF. In this article, I will show you how to convert scanned PDF to searchable PDF.

  I use software VeryDOC Raster to Text OCR Converter Command Line, which can also help you convert PDF to plain Text document and save the document as TXT format which can be edited freely. Please check more information on homepage, in the following part, I will show you how to make the conversion from scanned PDF to searchable PDF. The so called searchable PDF, is a kind of text based PDF file, which allows you to do copy and paste easily.

Step 1. Download Raster to Text OCR Converter Command Line

  • As its name shows, this is one suit of command line version software. When downloading finishes, there will be a zip file. You need to extract it to some folder then you can call the executable file in MS Dos Windows.
  • And this is Windows version software, it supports all the Window system both of 32-bit and 64-bit.

Step 2. Convert scanned PDF to searchable PDF

  • Here is the usage for your reference: pdf2txtocr.exe [options] <PDF-file> <Text-file>
  • When converting scanned PDF to searchable PDF, please refer to the following command line templates.
    pdf2txtocr.exe -ocr -lang deu -ocrmode 1 C:\in.pdf C:\out.pdf
    pdf2txtocr.exe -ocr -lang eng -ocrmode 2 C:\in.pdf C:\out.pdf
    pdf2txtocr.exe -ocr -lang eng -ocrmode 3 C:\in.pdf C:\out.pdf
    pdf2txtocr.exe -ocr -lang eng -ocrmode 2 -outboxfile C:\in.pdf C:\out.pdf
    pdf2txtocr.exe -ocr -lang fra -ocrmode 1 C:\in.pdf C:\out.pdf
    pdf2txtocr.exe -ocr -lang ita -ocrmode 1 C:\in.pdf C:\out.pdf
    pdf2txtocr.exe -ocr -lang nld -ocrmode 1 C:\in.pdf C:\out.pdf
    pdf2txtocr.exe -ocr -lang spa -ocrmode 1 C:\in.pdf C:\out.pdf
    pdf2txtocr.exe -bitcount 24 -ocrmode 4 -ocr C:\in.pdf C:\out.pdf
    pdf2txtocr.exe -bitcount 8 -ocrmode 4 -ocr C:\in.pdf C:\out.pd
    Now let us check related parameters.
  • -ocr                : enable OCR function for scanned PDF file
    -lang <string>      : choose the language for OCR engine
    -ocrmode <int>      : set OCR mode
      -ocrmode 0: output to text file
      -ocrmode 1: OCR PDF pages and insert new text layer under original PDF pages
      -ocrmode 2: output to plain text based PDF file
      -ocrmode 3: output to OCRed PDF file (BW) with hidden text layer
      -ocrmode 4: output to OCRed PDF file (Color) with hidden text layer

This software supports more than 50 OCR languages, so it can handle most of languages like English, French, German, Italian, Czech, Danish, Dutch, Norwegian, Polish, Portuguese, Spanish, Swedish, etc. scanned PDF to searchable PDF file.  And checking from the above parameters, you can know that this software supports 5 OCR modes which can help you OCR scanned PDF file more accurately.

There are two many functions of this software to be detailed, so check readme.txt file, you will find more useful information. During the using, if you have any question, please contact us as soon as possible.

DOC to Any Converter

Failed to call exeshell-x64.dll on 64bit Windows Server 2008 R2 system

Hi.

I try to call doc2any.exe from a PHP script using your exeshell class on a 64bit Windows Server 2008 R2 system. The exeshell-x64.dll was registered successfully - but I get the following error if I start the PHP script: PHP Fatal error: Uncaught exception 'com_exception' with message 'Failed to create COM object `exeshell.shell': Klasse nicht registriert What can I do to solve this problem?

Thanks in advance.

Regards
Customer
-------------------------------------
Actually I'm evaluating your product and downloaded the latest version from your homepage.
Build: Nov 10 2012

The registration of the exeshell-x64.dll was successfull:

Furthermore I activated the php-extension "php_com_dotnet.dll" within the IIS-Manager.

But I still get this error.

Thanks in advance.
Customer
-------------------------------------
I suggest you may call doc2any.exe from your C# or PHP or ASP.NET or other languages directly on your Windows 2008 system, you can use CreateProcess() or Process.Start() to call doc2any.exe application.

You need also set MS Office DCOM run inside an interactive user account instead of system user account, please look at following web pages for more information,

https://www.verydoc.com/doc-to-any-faq.html
https://www.verydoc.com/blog/aspnet-account-dcom-permisson-for-ms-word.html
https://www.verydoc.com/blog/microsoft-excel-application-entry-missing-in-dcomcnfg.html
https://www.verydoc.com/blog/how-to-make-iis7-play-nice-with-office-interop.html
https://www.verydoc.com/others/configure-word-and-excel.htm
https://www.verydoc.com/others/configure%20office%20applications%20to%20run%20under%20the%20interactive%20user%20account.htm
http://www.verypdf.com/wordpress/201201/how-to-call-doc2any-exe-or-htmltools-exe-from-a-service-20896.html

You can also set more answers in our Knowledge Base,

https://www.verydoc.com/blog/category/doc-to-any-converter

If you still can not get it work, please feel free to let us know, we will assist you continue.

VeryDOC

DOC to Any Converter, HTML Converter, HTMLPrint to Any Converter

HTML to PDF Converter, HtmlShell (HTMLConverter method) has different behavior on different systems

Hello,

We are analyzing their component HtmlShell (HTMLConverter method) and our tests we found that there is different behavior from one operating system to another.
We have a Windows Vista 64 in which the component works perfectly.
We have a Windows Vista 32 wherein component did not work.

We have 2 machines with Windows 7 32-bit component and works on only one of them.

We have a machine with Windows 8 in that the component does not work.

We need to know if you have had reported this type of behavior because it seems that something is missing in those systems where the component does not work.

I await

Sincerely,

-----------------------------------------------

Original text:

Olá,
Estamos analisando o seu componente HtmlShell (método HTMLConverter) e nos testes verificamos que existe comportamento diferente de um sistema operacional para outro.
Temos um Windows Vista 64 em que o componente funciona perfeitamente.
Temos um Windows Vista 32 em que o componente n?o funcionou.

Temos 2 máquinas com Windows 7 32 bits e o componente funciona em apenas uma delas.

Temos uma máquina com Windows 8 em que o componente n?o funciona.

Precisamos saber se vocês já tiveram relatado este tipo de comportamento pois nos parece que está faltando algo nesses sistemas em que o componente n?o funciona.

Aguardo,

Atenciosamente,

-----------------------------------------------

Yes, HtmlShell (HTMLConverter method) has the different behavior on different systems, because it is affected by screen resolution and IE versions.

If you wish get the same behavior on all systems, we suggest you may download following products from our website to try,

docPrint Pro v6.0,
http://www.verypdf.com/app/document-converter/try-and-buy.html
http://www.verypdf.com/artprint/docprint_pro_setup.exe

VeryDOC HTMLPrint to Any Converter,
https://www.verydoc.com/htmlprint-to-any.html
https://www.verydoc.com/htmlprint2any_cmd.zip

These products are all can convert HTML files to PDF files, because they are using printing technology to print HTML files to PDF files, so you will get same behavior on all systems.

We suggest you may download the trial version of above products from our website to try, please feel free to let us know if you encounter any problem.

Remark:

htmltools.exe application does render HTML page to Windows Metafile (EMF) first, and convert Windows Metafile (EMF) to PDF file again, the appearance of EMF file maybe changed by Screen Resolution, for example, 1028x768, 800x600, 1600x900 etc. Screen Resolution will create different EMF files.

docPrint Pro v6.0 and “HTMLPrint to Any Converter” are using printing function to create the PDF file, it is same as when you print the HTML file from IE by manual, it is not affected by Screen Resolution.

The speed of htmltools.exe is very fast for simple HTML files, htmltools.exe is not require any virtual printer, it is portable and standalone product, but if your HTML file is contain complicated contents, such as SVG, Flash, Java applet, etc. elements, docPrint Pro v6.0 and “HTMLPrint to Any Converter”  will work better for you.

See Also:

HTML to PDF conversion, which software is better for you?
http://www.verypdf.com/wordpress/201205/html-to-pdf-conversion-which-software-is-better-for-you-27560.html

How to Convert a HTML file or Web Pages to PDF file via Command Line?
http://www.verypdf.com/pdfcamp/convert-html-to-pdf.html

How to convert an Office document (DOC, DOCX, XLS, XLSX, PPT, PPTX, etc.) to PDF file via Command Line?
http://www.verypdf.com/document/convert-office-to-pdf.htm

VeryPDF

Keywords: Compare htmltools and docPrint, Metafile, EMF

Raster to Text OCR Command Line

How to convert raster image to searchable PDF and add basic information?

    When you need to convert image to searchable PDF and add basic information, this article will be helpful for you. The software I will use is VeryDOC Raster to Text OCR Converter Command Line, which can be used to recognize the text in many types of image files. More information, please check on software homepage. In the following part, I will show you how to use this software.

Step 1. Download Raster to Text OCR Converter Command Line

  • This is command line version software, so for uploading and downloading easy consideration, we have compressed it to zip file.
  • Once downloading finishes, please extract it to some folder then you can check its elements in it and call the executable file in MS Dos Windows.

Step 2. Convert raster image to searchable PDF and add basic information.

  • When you use this software, please refer to the usage and examples in readme.txt.
  • Here is the usage for your reference:  pdf2txtocr.exe [options] <PDF-file> <Text-file>
  • When converting raster image to searchable PDF, please refer to the following command line templates.
  • pdf2txtocr.exe -ocrmode 3 -threshold 200 -ocr C:\in.tif C:\out.pdf
    pdf2txtocr.exe -ocrmode 4 –producer VeryDOC C:\in.tif C:\out.pdf
    pdf2txtocr.exe -ocrmode 3 –creator  LA C:\in.tif C:\out.pdf
    pdf2txtocr.exe -ocrmode 4 –subject  “This is about conversion” C:\in.tif C:\out.pdf
    pdf2txtocr.exe -ocrmode 3 –title VeryDOC C:\in.tif C:\out.pdf
    pdf2txtocr.exe -ocrmode 4 –author ME C:\in.tif C:\out.pdf

By above command line templates, we can convert image file to searchable PDF and add basic information like title, keywords, subject, author and others. Here are parameters for your reference.

-producer <string>  : Set 'producer' to PDF file
-creator <string>   : Set 'creator' to PDF file
-subject <string>   : Set 'subject' to PDF file
-title <string>     : Set 'title' to PDF file
-author <string>    : Set 'author' to PDF file
-keywords <string>  : Set 'keywords' to PDF file

-ocrmode <int>      : set OCR mode
    -ocrmode 0: output to text file
    -ocrmode 1: OCR PDF pages and insert new text layer under original PDF pages
    -ocrmode 2: output to plain text based PDF file
    -ocrmode 3: output to OCRed PDF file (BW) with hidden text layer
    -ocrmode 4: output to OCRed PDF file (Color) with hidden text layer

The input image could be the following raster image formats: Scanned JPEG, PNG, BMP, GIF, PCX, TGA, PBM, PNM, PPM, tiff  files and so on. Meanwhile by this software, you can also deskew, rotate raster image and then convert them to PDF.  When converting them to PDF, you can also set password to protect PDF.

There are too many functions to be listed here. Please check more on the website, during the using, if you have any question, please contact us as soon as possible.

Raster to Text OCR Command Line

Convert scan to text through OCR technology

   When scan paper documents to image, it is easy to upload, transfer. But there is one problem that it is quite hard to extract text from scan file. So it will be hard for us to get information from it. If there is one page of scan file, we can type word from scan file to text. However, if there are thousands of pages, situation will be quite hard to handle.  In this article, I will show you how to convert scan to text through OCR technology.

  I software I use is VeryDOC Raster to Text OCR Converter Command Line, by it we can convert scan file in English, French, German, Italian, Czech, Danish, Dutch, Norwegian, Polish, Portuguese, Spanish, Swedish to text. In the following part, I will show you how to use this software.

Step 1. Download Raster to Text OCR Converter Command Line

  • On website, there are two Licenses: server version and developer version. If you just use this software on simply computer, laptop or server and do not use it for developing, simply choose the server version.
  • When downloading finishes, there will be a zip file. Please extract it to some folder then you can call the executable file in MS Dos Windows.

Step 2. Convert scan to text.

  • When use this software, please refer to the usage and examples.
  • Here is the usage for your reference: Usage: pdf2txtocr.exe [options] <PDF-file> <Text-file>
  • Here are some examples for your reference. You can scan file to any one of the below formats like TIFF, JPG, PNG, BMP, GIF, PCX, TGA, JP2, PNM and MNG.
  • pdf2txtocr.exe C:\in.tif C:\out.txt
    pdf2txtocr.exe C:\in.jpg C:\out.txt
    pdf2txtocr.exe C:\in.bmp C:\out.txt
    pdf2txtocr.exe C:\in.png C:\out.txt
    When convert those scan file to text, simply input the full path of the scan file and then output text file full path. By this way, you can convert scan file to text directly.

  • When converting tiff file in some other languages except English, please refer to the following command line template.
    pdf2txtocr.exe -lang deu C:\in.tif C:\out.txt
    Please add parameter –lang and corresponding languages parameters. This software supports more than 50 OCR languages like French, German, Italian, Czech, Danish, Dutch, Norwegian, Polish, Portuguese, Spanish, Swedish, etc. but you need to download corresponding language package on website. Please use the right language symbol like
     
  • Bulgarian bul.zip   Catalan cat.zip   Czech ces.zip  German deu.zip   Greek ell.zip   English  eng.zip  Finish  fin.zip     French fra.zip

    Hungarian hun.zip  Indonesian  ind.zip  Italian  ita.zip  Latvian  lav.zip  Lithuanianlit.zip  Dutch nld.zip

So this software will be your real helpful assistant when you need to extract text from scan file. And there are more parameters of this software, I can not list all of them here. During the using, if you have any question, please contact us as soon as possible.