茫茫網海中的冷日 - [轉貼]Java PDF parser PDFBox

爪哇咖啡屋 : [轉貼]Java PDF parser PDFBox

發表者

討論內容

冷日
(冷日)

發表時間：2012/7/5 14:14

Webmaster

註冊日: 2008/2/19
來自:
發表數: 15773

[轉貼]Java PDF parser PDFBox

Java PDF parser PDFBox

PDFBox

使用 command line 讀取文件

Usage: java org.pdfbox.ExtractText [OPTIONS] <PDF file> [Text File]
-password  <password>Password to decrypt document
-encoding <output encoding> (ISO-8859-1,UTF-16BE,UTF-16LE,...)
-console Send text to console instead of file
-html Output in HTML format instead of raw text
-sort Sort the text before writing
-startPage <number> The first page to start extraction(1 based)
-endPage <number> The last page to extract(inclusive)<PDF file>
The PDF document to use[Text File]
The file to write the text to

撰寫程式讀取 test.pdf 文件

//-- Main.java --
import org.pdfbox.pdmodel.PDDocument;
import org.pdfbox.util.PDFTextStripper;
public class Main {
       public static void main(String[] args) throws Exception {
              PDDocument doc = PDDocument.load("test.pdf");
              PDFTextStripper stripper = new PDFTextStripper();
              System.out.println(stripper.getText(doc));
       }
}

使用 command line 產生文件影像檔，但效果不好，中文字也產生不出來

Usage: java org.pdfbox.PDFToImage [OPTIONS] <PDF file>-password  <password>
Password to decrypt document
-imageType <image type>        (BMP,bmp,jpg,JPG,wbmp,jpeg,png,PNG,JPEG,WBMP,GIF,gif)
-outputPrefix <output prefix>  Filename prefix for image files
-startPage <number>          The first page to start extraction(1 based)
-endPage <number>            The last page to extract(inclusive)<PDF file>
The PDF document to

use撰寫程式產生文件影像檔，參考 PDFToImage 的 source code，結果相同

//-- Main.java --
import java.awt.image.BufferedImage;
import java.io.File;
import java.util.Iterator;
import java.util.List;
import javax.imageio.IIOException;
import javax.imageio.IIOImage;
import javax.imageio.ImageIO;
import javax.imageio.ImageWriteParam;
import javax.imageio.ImageWriter;
import javax.imageio.stream.ImageOutputStream;
import org.pdfbox.pdmodel.PDDocument;
import org.pdfbox.pdmodel.PDPage;

public class Main {
  public static void main(String[] args) throws Exception {
    PDDocument doc = PDDocument.load("test.pdf");
    List pages = doc.getDocumentCatalog().getAllPages();
    Iterator iter = pages.iterator();
    while (iter.hasNext()) {
      PDPage page = (PDPage) iter.next();
      BufferedImage image = page.convertToImage();
      File file = File.createTempFile("test_", ".jpg");
      System.out.println(file);
      ImageOutputStream output = ImageIO.createImageOutputStream(file);
      try {
        boolean foundWriter = false;
        Iterator writerIter = ImageIO.getImageWritersByFormatName("jpg");
        while (writerIter.hasNext() && !foundWriter) {
          try {
            ImageWriter imageWriter = (ImageWriter) writerIter.next();
            try {
              ImageWriteParam writerParams = imageWriter.getDefaultWriteParam();
              if (writerParams.canWriteCompressed()) {
                writerParams.setCompressionMode(ImageWriteParam.MODE_EXPLICIT);
                writerParams.setCompressionQuality(1.0f);
              }
              imageWriter.setOutput(output);
              imageWriter.write(null, new IIOImage(image, null, null), writerParams);
              foundWriter = true;
            }
            finally {
              imageWriter.dispose();
            }
          } catch (IIOException io) {
            io.printStackTrace();
          }
        }
      } finally {
        output.close();
      }
    }
  }
}

原文出處：Solnone 螺旋旅人: Java PDF parser PDFBox

回覆

冷日
(冷日)

發表時間：2012/7/5 14:16

Webmaster

註冊日: 2008/2/19
來自:
發表數: 15773

[轉貼]Java 解析 PDF， pdfbox读取PDF内容

Java 解析 PDF， pdfbox读取PDF内容
博客分类： JavaJava 网页抓取

import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.OutputStreamWriter;
import org.pdfbox.pdmodel.PDDocument;
import org.pdfbox.util.PDFTextStripper;

public class Pdftext {
 public static String getTxt(File f) throws Exception {
  String ts = "";
  try {
   String temp = "";
   PDDocument pdfdocument = PDDocument.load(f);

   ByteArrayOutputStream out = new ByteArrayOutputStream();
   OutputStreamWriter writer = new OutputStreamWriter(out);
   PDFTextStripper stripper = new PDFTextStripper();

   stripper.writeText(pdfdocument.getDocument(), writer);

   pdfdocument.close();
   out.close();
   writer.close();
   byte[] contents = out.toByteArray();
   ts = new String(contents);
   System.out.println(f.getName() + "length is:" + contents.length
     + "\n");
  } catch (Exception e) {
   e.printStackTrace();
  } finally {
   return ts;
  }
 }

 public static void main(String[] args) throws Exception {

     File file = new File("d:/hello.pdf");
     System.out.println(Pdftext.getTxt(file));


/*
  File file = new File("d:/hello.pdf");
  FileInputStream fis = new FileInputStream(file);
  BufferedInputStream bis = new BufferedInputStream(fis);
  PDFParser parser = new PDFParser(bis);

  //
  parser.parse();
  PDDocument document = parser.getPDDocument();

  PDFTextStripper stripper = new PDFTextStripper();
  String s = stripper.getText(document);

  // ////////////
  document.close();// /////////
  bis.close();

  // //////////
  File ff = new File("d:/hello.pdf");
  ff.createNewFile();

  if (ff.exists())

  {
   ff.createNewFile();
  }

  FileWriter fw = new FileWriter(ff);

  BufferedWriter bw = new BufferedWriter(fw);

  bw.write(s);
  bw.close();*/

 }

}

原文出處：Java 解析 PDF， pdfbox读取PDF内容 - - ITeye技术网站

回覆

冷日
(冷日)

發表時間：2012/7/5 14:19

Webmaster

註冊日: 2008/2/19
來自:
發表數: 15773

[轉貼]PDF Text Parser: Converting PDF to Text in Java using PDFBox

PDF Text Parser: Converting PDF to Text in Java using PDFBox

Converting PDF to text is an interesting task which has its use in many applications from search engines indexing PDF documents to other data processing tasks. I was looking for a java based API to convert PDF to text, or in other words a PDF Text parser in java, after going through many articles, the PDFBox project came to my rescue. PDFBox is a library which can handle different types of PDF documents including encrypted PDF formats and extracts text and has a command line utility as well to convert PDF to text documents.

I found the need to have a reusable java class to convert PDF Documents to text in one of my projects and the below java code does the same using the PDFBox java API. It takes two command line parameters, the input PDF file and the output text file, to which the parsed text from the PDF document will be written.

This code was tested with PDFBox 0.7.3 although it should work with other versions of PDFBox as well, it can be easily integrated with other java applications and can be used as a command line utility as well, the steps to run this code is furnished below.

Listing 1:
PDFTextParser.java

1: /*
2:  * PDFTextParser.java
3:  * Author: S.Prasanna
4:  *
5:  */
6:
7:  import org.pdfbox.cos.COSDocument;
8:  import org.pdfbox.pdfparser.PDFParser;
9:  import org.pdfbox.pdmodel.PDDocument;
10: import org.pdfbox.pdmodel.PDDocumentInformation;
11: import org.pdfbox.util.PDFTextStripper;
12:
13: import java.io.File;
14: import java.io.FileInputStream;
15: import java.io.PrintWriter;
16:
17: public class PDFTextParser {
18:
19:     PDFParser parser;
20:     String parsedText;
21:     PDFTextStripper pdfStripper;
22:     PDDocument pdDoc;
23:     COSDocument cosDoc;
24:     PDDocumentInformation pdDocInfo;
25:
26:     // PDFTextParser Constructor 
27:     public PDFTextParser() {
28:     }
29:
30:     // Extract text from PDF Document
31:     String pdftoText(String fileName) {
32:   
33:         System.out.println("Parsing text from PDF file " + fileName + "....");
34:         File f = new File(fileName);
35:   
36:         if (!f.isFile()) {
37:             System.out.println("File " + fileName + " does not exist.");
38:             return null;
39:         }
40:   
41:         try {
42:             parser = new PDFParser(new FileInputStream(f));
43:         } catch (Exception e) {
44:             System.out.println("Unable to open PDF Parser.");
45:             return null;
46:         }
47:   
48:         try {
49:             parser.parse();
50:             cosDoc = parser.getDocument();
51:             pdfStripper = new PDFTextStripper();
52:             pdDoc = new PDDocument(cosDoc);
53:             parsedText = pdfStripper.getText(pdDoc);
54:         } catch (Exception e) {
55:             System.out.println("An exception occured in parsing the PDF Document.");
56:             e.printStackTrace();
57:             try {
58:                    if (cosDoc != null) cosDoc.close();
59:                    if (pdDoc != null) pdDoc.close();
60:                } catch (Exception e1) {
61:                e.printStackTrace();
62:             }
63:             return null;
64:         }
65:         System.out.println("Done.");
66:         return parsedText;
67:     }
68:
69:     // Write the parsed text from PDF to a file
70:     void writeTexttoFile(String pdfText, String fileName) {
71:   
72:         System.out.println("\nWriting PDF text to output text file " + fileName + "....");
73:         try {
74:             PrintWriter pw = new PrintWriter(fileName);
75:             pw.print(pdfText);
76:             pw.close();  
77:         } catch (Exception e) {
78:             System.out.println("An exception occured in writing the pdf text to file.");
79:             e.printStackTrace();
80:         }
81:         System.out.println("Done.");
82:     }
83:
84:     //Extracts text from a PDF Document and writes it to a text file
85:     public static void main(String args[]) {
86:   
87:         if (args.length != 2) {
88:             System.out.println("Usage: java PDFTextParser  ");
89:             System.exit(1);
90:         }
91:   
92:         PDFTextParser pdfTextParserObj = new PDFTextParser();
93:         String pdfToText = pdfTextParserObj.pdftoText(args[0]);
94:   
95:         if (pdfToText == null) {
96:             System.out.println("PDF to Text Conversion failed.");
97:         }
98:         else {
99:             System.out.println("\nThe text parsed from the PDF Document....\n" + pdfToText);
100:             pdfTextParserObj.writeTexttoFile(pdfToText, args[1]);
101:         }
102:     }
103: }

Explanation:

The above code takes two command line parameters, the input PDF file and the output text file, the method pdftoText in line 31 handles the text parsing functionality and the writeTexttoFile method in line 70 writes the parsed text to the output file.

Compliling and Running the code:

I used PDFBox 0.7.3 to compile/run the above code, so you need to add those jars in your java project settings.

1. Download PDFBox 0.7.3 from
here.
2. Unzip PDFBox-0.7.3.zip.
3. Under the PDFBox-0.7.3 folder, add the jars in the lib (PDFBox-0.7.3.jar) and external directory (other external packages used by PDFBox-0.7.3) to the classpath to compile/run the code, it should work fine.

Note: I used JDK 1.6 to compile the above code.

原文出處：Techtalks: PDF Text Parser: Converting PDF to Text in Java using PDFBox

回覆