Dec 21

Extract images from PDF, how to

Category: Linux,multimedia,Ubuntu   — Published by goeszen on December 21, 2015 at 2:03 pm

In case you want to extract images from a PDF, and I mean really extract what's embedded in the PDF, not just taking a screencap or something, here is a low-level way of saving embedded images from a PDF file:

Extracting is possible via poppler-utils. This comes with a small tool called pdfimages, a "PDF image extractor tool that saves images from a PDF file to PPM, PBM or JPEG file(s) format".

Usage is: pdfimages [options]

So, for example, to save all images from a pdf in JPEG format, do:

$ pdfimages -j in.pdf /tmp/out

which will save all images found, numbered ascendingly, to /tmp.

This command, saves all images in JPEG, also when images are not stored as such. So it's only doing a "native" save on JPEGs. Newer versions of pdfimages (past Ubuntu 12.04) offer the "-all" switch, about which the docs say "Write JPEG, JPEG2000, JBIG2, and CCITT images in their native format. CMYK files are written as TIFF files. All other images are written as PNG files. This is equivalent to specifying the options -png -tiff -j -jp2 -jbig2 -ccitt.". If your version of poppler-utils doesn't come with this recent option, you may go with the option to export to PNM, which is a lossless format, and can later be converted to PNG, for example, for a complete lossless work-flow (if you care about loosing one generation in the first place).

This is a sort-of follow up post to:
Combining two or more PDFs into one, on Linux, and
Merging multiple pdf files and jpg files into one PDF
Convert images to PDF document, on Linux, on CLI

Leave a Reply

=