Raster to Text OCR Command Line

How to convert multi-page PDF to text and insert page break symbol?

     In this article, I will show you how to convert multi-page PDF to text and insert page break symbol by command line operation. The software I will use is VeryDOC Raster to Text OCR Converter Command Line, by which you can recognize many kinds of languages in PDF to text. Please check more information on homepage, in the following part, I will show you how to use this software.

Step 1. Download Raster to Text OCR Converter Command Line 

  • Even if we name this software as Raster to text converter, but it supports many version files as input like scanned PDF, text based PDF, TIFF, JPG, PNG, BMP, GIF, PCX, TGA and others. So this software can help you convert all version PDF file to plain text and insert page break symbol.
  • When downloading finishes, there will be a zip file. Please extract it to some folder then you can call the executable file in MS Dos Windows.

Step 2.  Convert multi-page PDF to text and insert page break symbol.

  • When you use this software, please obey rules of this software and follow examples templates.
  • Here is the usage for your reference:  pdf2txtocr.exe [options] <PDF-file> <Text-file>
  • When convert text based PDF to text and insert page break, please refer to the following command line templates.
  • pdf2txtocr.exe C:\in.pdf C:\out.txt
    By this simply command line, we can convert PDF to text and insert page break automatically.
    pdf2txtocr.exe -firstpage 1 -lastpage 1 C:\in.pdf C:\out.txt
    By this command line, we can convert PDF to text and and choose conversion page range.
    pdf2txtocr.exe -ownerpwd 123 -userpwd 456 C:\in.pdf C:\out.txt
    By this command line, we can convert password protected PDF file to text and insert page break.
    pdf2txtocr.exe -layout C:\in.pdf C:\out.txt
    By this command line, we can convert PDF to text and maintain original layout. 
    Please do not be surprised for there is no parameter about page break used in above command line as this software will convert PDF to text and insert page break automatically. When you do not need to insert page break, please add this parameter
    -noc                : don't insert page breaks 0x0C between pages in text file. 
    The above command line only can be used to convert text based PDF to text and insert page break.

  • When converting image based PDF to text and insert page break, please refer to the following command line templates.
  • pdf2txtocr.exe -ocr -lang eng -ocrmode 0 C:\in.pdf C:\out.txt
    pdf2txtocr.exe -ocr -lang deu -ocrmode 1 C:\in.pdf C:\out.pdf
    You need to add paramer –OCR to launch OCR function then you can run the conversion successfully.

There are more examples and parameters in readme.txt, please check more detail information there. I can not list all of them here. During the using, if you have any question, please contact us as soon as possible.

VN:F [1.9.20_1166]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.20_1166]
Rating: 0 (from 0 votes)
Raster to Text OCR Command Line

How to convert scanned PDF to text and keep original layout?

For extracting content from scanned PDF file, sometime we need to convert scanned PDF to text. And in order to find corresponding words in output text file. we also hope to convert scanned PDF to text and keep original  layout.  VeryDOC Raster to Text OCR Converter Command Line has such function, by this software, you can also convert encrypted PDF document to Text with user password or owner password. For more information, please check software on homepage, in the following part, I will show you how to use this software.

Step 1. Download Raster to Text OCR Converter Command Line

  • There are two versions of this software on the website: Server License and Developer License. Under one Server License, you can use the corresponding SOFTWARE on exactly one server computer that offers service to clients. If the SOFTWARE contains source codes, you have the right to modify and reuse the codes under the Server License. Under one Developer License, you can integrate the corresponding SOFTWARE into your developed software and redistribute it with royalty-free. If the SOFTWARE contains source codes, you have the right to modify and reuse the codes under the Developer License.
  • When downloading finishes, there will be a zip file. Please  extract it to some folder then you can find the executable file and then call it from MS Dos Windows.

Step 2. Convert scanned PDF to text and keep original layout.

  • When you use this software, please refer to the usage and examples.
  • Here is the usage for your reference:   pdf2txtocr.exe [options] <PDF-file> <Text-file>
  • When converting scanned PDF to text, please refer to the following command line templates.
    pdf2txtocr.exe -ocr -lang eng -layout C:\in.pdf C:\out.txt
    By this command line, we can convert English scanned PDF file to text and keep original layout and formats. 
    pdf2txtocr.exe -ocr -bitcount 1 C:\in.pdf C:\out.txt
    By this command line, we can convert scanned PDF to text and specify output bit count.
    pdf2txtocr.exe -ocr -bitcount 8 C:\in.pdf C:\out.txt
    pdf2txtocr.exe -ocr -bitcount 24 C:\in.pdf C:\out.txt
    These two command line templates are same with the above one.
    pdf2txtocr.exe -ocr -lang deu C:\in.pdf C:\out.txt
    By this command line, we can convert Germany scanned PDF to text.
    pdf2txtocr.exe -text "PageText %PageNumber% of %PageCount%" C:\in.pdf C:\out.txt
    By this command line, we can convert scanned PDF to text and add page number on output text file.

Now let us check related parameters.
-layout : maintain original physical layout
-bitcount <int> : set color depth when render PDF page to image data, it can be set 1, 8, 24, default is 8bit
-text <string> : add additional text at end of each text page, this parameter supports the following variables:
    %PageNumber%: current page number
    %PageCount% : total page count of PDF file
-ocr                : enable OCR function for scanned PDF file
  -lang <string>      : choose the language for OCR engine
  -ocrmode <int>      : set OCR mode
    -ocrmode 0: output to text file
    -ocrmode 1: OCR PDF pages and insert new text layer under original PDF pages
    -ocrmode 2: output to plain text based PDF file
    -ocrmode 3: output to OCRed PDF file (BW) with hidden text layer
    -ocrmode 4: output to OCRed PDF file (Color) with hidden text layer

By those examples and parameters, you can convert scanned PDF to text easily. During the using, if you have any question, please contact us as soon as possible.

VN:F [1.9.20_1166]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.20_1166]
Rating: 0 (from 0 votes)
Raster to Text OCR Command Line

How to convert scanned PDF to searchable PDF?

    There is a kind of PDF file which is created by sending Office files, images, etc. to an Acrobat like PDF printer and those created by scanning physical paper like pages of a book, legal documents, etc. Normally speaking, those kinds of PDF file can not be edited let alone extract text from it. This feature will cause some using problem when you need to reuse the content in scanned PDF. In this article, I will show you how to convert scanned PDF to searchable PDF.

  I use software VeryDOC Raster to Text OCR Converter Command Line, which can also help you convert PDF to plain Text document and save the document as TXT format which can be edited freely. Please check more information on homepage, in the following part, I will show you how to make the conversion from scanned PDF to searchable PDF. The so called searchable PDF, is a kind of text based PDF file, which allows you to do copy and paste easily.

Step 1. Download Raster to Text OCR Converter Command Line

  • As its name shows, this is one suit of command line version software. When downloading finishes, there will be a zip file. You need to extract it to some folder then you can call the executable file in MS Dos Windows.
  • And this is Windows version software, it supports all the Window system both of 32-bit and 64-bit.

Step 2. Convert scanned PDF to searchable PDF

  • Here is the usage for your reference: pdf2txtocr.exe [options] <PDF-file> <Text-file>
  • When converting scanned PDF to searchable PDF, please refer to the following command line templates.
    pdf2txtocr.exe -ocr -lang deu -ocrmode 1 C:\in.pdf C:\out.pdf
    pdf2txtocr.exe -ocr -lang eng -ocrmode 2 C:\in.pdf C:\out.pdf
    pdf2txtocr.exe -ocr -lang eng -ocrmode 3 C:\in.pdf C:\out.pdf
    pdf2txtocr.exe -ocr -lang eng -ocrmode 2 -outboxfile C:\in.pdf C:\out.pdf
    pdf2txtocr.exe -ocr -lang fra -ocrmode 1 C:\in.pdf C:\out.pdf
    pdf2txtocr.exe -ocr -lang ita -ocrmode 1 C:\in.pdf C:\out.pdf
    pdf2txtocr.exe -ocr -lang nld -ocrmode 1 C:\in.pdf C:\out.pdf
    pdf2txtocr.exe -ocr -lang spa -ocrmode 1 C:\in.pdf C:\out.pdf
    pdf2txtocr.exe -bitcount 24 -ocrmode 4 -ocr C:\in.pdf C:\out.pdf
    pdf2txtocr.exe -bitcount 8 -ocrmode 4 -ocr C:\in.pdf C:\out.pd
    Now let us check related parameters.
  • -ocr                : enable OCR function for scanned PDF file
    -lang <string>      : choose the language for OCR engine
    -ocrmode <int>      : set OCR mode
      -ocrmode 0: output to text file
      -ocrmode 1: OCR PDF pages and insert new text layer under original PDF pages
      -ocrmode 2: output to plain text based PDF file
      -ocrmode 3: output to OCRed PDF file (BW) with hidden text layer
      -ocrmode 4: output to OCRed PDF file (Color) with hidden text layer

This software supports more than 50 OCR languages, so it can handle most of languages like English, French, German, Italian, Czech, Danish, Dutch, Norwegian, Polish, Portuguese, Spanish, Swedish, etc. scanned PDF to searchable PDF file.  And checking from the above parameters, you can know that this software supports 5 OCR modes which can help you OCR scanned PDF file more accurately.

There are two many functions of this software to be detailed, so check readme.txt file, you will find more useful information. During the using, if you have any question, please contact us as soon as possible.

VN:F [1.9.20_1166]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.20_1166]
Rating: 0 (from 0 votes)
DOC to Any Converter

Failed to call exeshell-x64.dll on 64bit Windows Server 2008 R2 system

Hi.

I try to call doc2any.exe from a PHP script using your exeshell class on a 64bit Windows Server 2008 R2 system. The exeshell-x64.dll was registered successfully - but I get the following error if I start the PHP script: PHP Fatal error: Uncaught exception 'com_exception' with message 'Failed to create COM object `exeshell.shell': Klasse nicht registriert What can I do to solve this problem?

Thanks in advance.

Regards
Customer
-------------------------------------
Actually I'm evaluating your product and downloaded the latest version from your homepage.
Build: Nov 10 2012

The registration of the exeshell-x64.dll was successfull:

Furthermore I activated the php-extension "php_com_dotnet.dll" within the IIS-Manager.

But I still get this error.

Thanks in advance.
Customer
-------------------------------------
I suggest you may call doc2any.exe from your C# or PHP or ASP.NET or other languages directly on your Windows 2008 system, you can use CreateProcess() or Process.Start() to call doc2any.exe application.

You need also set MS Office DCOM run inside an interactive user account instead of system user account, please look at following web pages for more information,

https://www.verydoc.com/doc-to-any-faq.html
https://www.verydoc.com/blog/aspnet-account-dcom-permisson-for-ms-word.html
https://www.verydoc.com/blog/microsoft-excel-application-entry-missing-in-dcomcnfg.html
https://www.verydoc.com/blog/how-to-make-iis7-play-nice-with-office-interop.html
https://www.verydoc.com/others/configure-word-and-excel.htm
https://www.verydoc.com/others/configure%20office%20applications%20to%20run%20under%20the%20interactive%20user%20account.htm
http://www.verypdf.com/wordpress/201201/how-to-call-doc2any-exe-or-htmltools-exe-from-a-service-20896.html

You can also set more answers in our Knowledge Base,

https://www.verydoc.com/blog/category/doc-to-any-converter

If you still can not get it work, please feel free to let us know, we will assist you continue.

VeryDOC

VN:F [1.9.20_1166]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.20_1166]
Rating: 0 (from 0 votes)
DOC to Any Converter, HTML Converter, HTMLPrint to Any Converter

HTML to PDF Converter, HtmlShell (HTMLConverter method) has different behavior on different systems

Hello,

We are analyzing their component HtmlShell (HTMLConverter method) and our tests we found that there is different behavior from one operating system to another.
We have a Windows Vista 64 in which the component works perfectly.
We have a Windows Vista 32 wherein component did not work.

We have 2 machines with Windows 7 32-bit component and works on only one of them.

We have a machine with Windows 8 in that the component does not work.

We need to know if you have had reported this type of behavior because it seems that something is missing in those systems where the component does not work.

I await

Sincerely,

-----------------------------------------------

Original text:

Olá,
Estamos analisando o seu componente HtmlShell (método HTMLConverter) e nos testes verificamos que existe comportamento diferente de um sistema operacional para outro.
Temos um Windows Vista 64 em que o componente funciona perfeitamente.
Temos um Windows Vista 32 em que o componente n?o funcionou.

Temos 2 máquinas com Windows 7 32 bits e o componente funciona em apenas uma delas.

Temos uma máquina com Windows 8 em que o componente n?o funciona.

Precisamos saber se vocês já tiveram relatado este tipo de comportamento pois nos parece que está faltando algo nesses sistemas em que o componente n?o funciona.

Aguardo,

Atenciosamente,

-----------------------------------------------

Yes, HtmlShell (HTMLConverter method) has the different behavior on different systems, because it is affected by screen resolution and IE versions.

If you wish get the same behavior on all systems, we suggest you may download following products from our website to try,

docPrint Pro v6.0,
http://www.verypdf.com/app/document-converter/try-and-buy.html
http://www.verypdf.com/artprint/docprint_pro_setup.exe

VeryDOC HTMLPrint to Any Converter,
https://www.verydoc.com/htmlprint-to-any.html
https://www.verydoc.com/htmlprint2any_cmd.zip

These products are all can convert HTML files to PDF files, because they are using printing technology to print HTML files to PDF files, so you will get same behavior on all systems.

We suggest you may download the trial version of above products from our website to try, please feel free to let us know if you encounter any problem.

Remark:

htmltools.exe application does render HTML page to Windows Metafile (EMF) first, and convert Windows Metafile (EMF) to PDF file again, the appearance of EMF file maybe changed by Screen Resolution, for example, 1028x768, 800x600, 1600x900 etc. Screen Resolution will create different EMF files.

docPrint Pro v6.0 and “HTMLPrint to Any Converter” are using printing function to create the PDF file, it is same as when you print the HTML file from IE by manual, it is not affected by Screen Resolution.

The speed of htmltools.exe is very fast for simple HTML files, htmltools.exe is not require any virtual printer, it is portable and standalone product, but if your HTML file is contain complicated contents, such as SVG, Flash, Java applet, etc. elements, docPrint Pro v6.0 and “HTMLPrint to Any Converter”  will work better for you.

See Also:

HTML to PDF conversion, which software is better for you?
http://www.verypdf.com/wordpress/201205/html-to-pdf-conversion-which-software-is-better-for-you-27560.html

How to Convert a HTML file or Web Pages to PDF file via Command Line?
http://www.verypdf.com/pdfcamp/convert-html-to-pdf.html

How to convert an Office document (DOC, DOCX, XLS, XLSX, PPT, PPTX, etc.) to PDF file via Command Line?
http://www.verypdf.com/document/convert-office-to-pdf.htm

VeryPDF

Keywords: Compare htmltools and docPrint, Metafile, EMF

VN:F [1.9.20_1166]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.20_1166]
Rating: 0 (from 0 votes)