PDF Data Extraction In Linux – noi3.org

This is a tip sent by WebUpd8 reader Stone Cut, on extracting images and text from PDF files. It's different from his previous tip and useful for other cases.
Firstly, install the necessary utilities:

– Ubuntu:

 sudo apt-get install poppler-utils

– Fedora:

 sudo yum install poppler-utils

For other Linux distributions, search for poppler-utils in your package manager.

– Ubuntu:

 sudo apt-get install poppler-utils

– Fedora:

 sudo yum install poppler-utils

For other Linux distributions, search for poppler-utils in your package manager.

This command will extract all the images from "pdffile.pdf" and put them in the /home/<username>/pdfimages/ directory:

 pdfimages -j pdffile.pdf ~/pdfimages/

The JPEG files will be saved with PPM extension with pdfimages unless you specify the "-j" (for JPEG) parameter.

The advantage of pdfimages is that it will extract the original images as embedded in the PDF – For example: I extracted a PDF from our local kindergarten so I could use some images for an invitation and I was quite surprised to find out that the embedded image was much larger and showed much more of the photo when extracted than when embedded. Before that parts of the image were masked by the rest of the layout. Interesting and very useful.

This command will extract all the actual text and put a file with the same name as the PDF but with TXT extension (pdffile.txt) in the same directory as the source file:

 pdftotext pdffile.pdf

Please note, that this command will only extract real text. If your PDF contains images with text printed on them then this won't work – please refer to my older tip for these sorts of files: How To Extract All Text From PDFs (Including Text In Images).

Articolul original