{"id":1684,"date":"2012-06-20T14:47:29","date_gmt":"2012-06-20T14:47:29","guid":{"rendered":"https:\/\/noi3.org\/site\/?p=1684"},"modified":"2012-06-20T14:47:29","modified_gmt":"2012-06-20T14:47:29","slug":"pdf-data-extraction-in-linux","status":"publish","type":"post","link":"https:\/\/site.noi3.org\/?p=1684","title":{"rendered":"PDF Data Extraction In Linux"},"content":{"rendered":"<p> \t<img loading=\"lazy\" decoding=\"async\" class=\" size-full wp-image-1683\" alt=\"\" src=\"https:\/\/noi3.org\/site\/wp-content\/uploads\/2012\/06\/pdf.png\" style=\"height: 250px; width: 250px;\" width=\"0\" height=\"0\" \/>This is a tip sent by WebUpd8 reader Stone Cut, on extracting images and text from PDF files. It&#39;s different from his <a href=\"http:\/\/www.webupd8.org\/2010\/02\/how-to-extract-all-text-from-pdfs.html\" title=\"How To Extract All Text From PDFs (Including Text In Images) [Ubuntu]\">previous tip<\/a> and useful for other cases.<br \/> \t<a name=\"more\"><\/a>Firstly, install the necessary utilities:<\/p>\n<p> \t&#8211; Ubuntu:<\/p>\n<pre class=\"linux-code\"> <code>sudo apt-get install poppler-utils<\/code><\/pre>\n<p> \t&#8211; Fedora:<\/p>\n<pre class=\"linux-code\"> <code>sudo yum install poppler-utils<\/code><\/pre>\n<p> \tFor other Linux distributions, search for poppler-utils in your package manager.<\/p>\n<p>  <!--more-->  <\/p>\n<p> \t<img loading=\"lazy\" decoding=\"async\" class=\" size-full wp-image-1683\" alt=\"\" src=\"https:\/\/noi3.org\/site\/wp-content\/uploads\/2012\/06\/pdf.png\" width=\"0\" height=\"0\" \/><\/p>\n<p> \tThis is a tip sent by WebUpd8 reader Stone Cut, on extracting images and text from PDF files. It&#39;s different from his <a href=\"http:\/\/www.webupd8.org\/2010\/02\/how-to-extract-all-text-from-pdfs.html\" title=\"How To Extract All Text From PDFs (Including Text In Images) [Ubuntu]\">previous tip<\/a> and useful for other cases.<br \/> \t<a name=\"more\"><\/a>Firstly, install the necessary utilities:<\/p>\n<p> \t&#8211; Ubuntu:<\/p>\n<pre class=\"linux-code\"> <code>sudo apt-get install poppler-utils<\/code><\/pre>\n<p> \t&#8211; Fedora:<\/p>\n<pre class=\"linux-code\"> <code>sudo yum install poppler-utils<\/code><\/pre>\n<p> \tFor other Linux distributions, search for poppler-utils in your package manager.<\/p>\n<p> \tThis command will <b>extract all the images <\/b>from &quot;pdffile.pdf&quot; and put them in the <i><b>\/home\/&lt;username&gt;\/pdfimages\/<\/b><\/i> directory:<\/p>\n<pre class=\"linux-code\"> <code>pdfimages -j pdffile.pdf ~\/pdfimages\/<\/code><\/pre>\n<div style=\"text-align: justify;\"> \tThe JPEG files will be saved with PPM extension with pdfimages unless you specify the &quot;-j&quot; (for JPEG) parameter.<\/div>\n<p> <\/p>\n<div style=\"text-align: justify;\"> \tThe advantage of pdfimages is that it will extract the original images as embedded in the PDF &#8211; For example: I extracted a PDF from our local kindergarten so I could use some images for an invitation and I was quite surprised to find out that the embedded image was much larger and showed much more of the photo when extracted than when embedded. Before that parts of the image were masked by the rest of the layout. Interesting and very useful.<\/div>\n<p> <\/p>\n<div style=\"text-align: justify;\"> \tThis command will <b>extract all the actual text<\/b> and put a file with the same name as the PDF but with TXT extension (pdffile.txt) in the same directory as the source file:<\/div>\n<pre class=\"linux-code\"> <code>pdftotext pdffile.pdf<\/code><\/pre>\n<p> \tPlease note, that this command will only extract real text. If your PDF contains images with text printed on them then this won&#39;t work &#8211; please refer to my older tip for these sorts of files: <a href=\"http:\/\/www.webupd8.org\/2010\/02\/how-to-extract-all-text-from-pdfs.html\">How To Extract All Text From PDFs (Including Text In Images)<\/a>.<\/p>\n<p> \t&nbsp;<\/p>\n<p> \t<a href=\"http:\/\/www.webupd8.org\/2012\/06\/pdf-data-extraction-in-linux.html\">Articolul original<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>This is a tip sent by WebUpd8 reader Stone Cut, on extracting images and text from PDF files. It&#39;s different from his previous tip and&hellip;<\/p>\n","protected":false},"author":1,"featured_media":1683,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[30],"tags":[698,442,790,791],"class_list":["post-1684","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-informatica","tag-extractie","tag-imagini","tag-pdf","tag-text"],"_links":{"self":[{"href":"https:\/\/site.noi3.org\/index.php?rest_route=\/wp\/v2\/posts\/1684","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/site.noi3.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/site.noi3.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/site.noi3.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/site.noi3.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1684"}],"version-history":[{"count":0,"href":"https:\/\/site.noi3.org\/index.php?rest_route=\/wp\/v2\/posts\/1684\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/site.noi3.org\/index.php?rest_route=\/wp\/v2\/media\/1683"}],"wp:attachment":[{"href":"https:\/\/site.noi3.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1684"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/site.noi3.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=1684"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/site.noi3.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1684"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}