茫茫網海中的冷日 - 對這文章發表回應

Converting PDF to text is an interesting task which has its use in many applications from search engines indexing PDF documents to other data processing tasks. I was looking for a java based API to convert PDF to text, or in other words a PDF Text parser in java, after going through many articles, the PDFBox project came to my rescue. PDFBox is a library which can handle different types of PDF documents including encrypted PDF formats and extracts text and has a command line utility as well to convert PDF to text documents.

I found the need to have a reusable java class to convert PDF Documents to text in one of my projects and the below java code does the same using the PDFBox java API. It takes two command line parameters, the input PDF file and the output text file, to which the parsed text from the PDF document will be written.

This code was tested with PDFBox 0.7.3 although it should work with other versions of PDFBox as well, it can be easily integrated with other java applications and can be used as a command line utility as well, the steps to run this code is furnished below.

Listing 1:
PDFTextParser.java

1: /*
2:  * PDFTextParser.java
3:  * Author: S.Prasanna
4:  *
5:  */
6:
7:  import org.pdfbox.cos.COSDocument;
8:  import org.pdfbox.pdfparser.PDFParser;
9:  import org.pdfbox.pdmodel.PDDocument;
10: import org.pdfbox.pdmodel.PDDocumentInformation;
11: import org.pdfbox.util.PDFTextStripper;
12:
13: import java.io.File;
14: import java.io.FileInputStream;
15: import java.io.PrintWriter;
16:
17: public class PDFTextParser {
18:
19:     PDFParser parser;
20:     String parsedText;
21:     PDFTextStripper pdfStripper;
22:     PDDocument pdDoc;
23:     COSDocument cosDoc;
24:     PDDocumentInformation pdDocInfo;
25:
26:     // PDFTextParser Constructor 
27:     public PDFTextParser() {
28:     }
29:
30:     // Extract text from PDF Document
31:     String pdftoText(String fileName) {
32:   
33:         System.out.println("Parsing text from PDF file " + fileName + "....");
34:         File f = new File(fileName);
35:   
36:         if (!f.isFile()) {
37:             System.out.println("File " + fileName + " does not exist.");
38:             return null;
39:         }
40:   
41:         try {
42:             parser = new PDFParser(new FileInputStream(f));
43:         } catch (Exception e) {
44:             System.out.println("Unable to open PDF Parser.");
45:             return null;
46:         }
47:   
48:         try {
49:             parser.parse();
50:             cosDoc = parser.getDocument();
51:             pdfStripper = new PDFTextStripper();
52:             pdDoc = new PDDocument(cosDoc);
53:             parsedText = pdfStripper.getText(pdDoc);
54:         } catch (Exception e) {
55:             System.out.println("An exception occured in parsing the PDF Document.");
56:             e.printStackTrace();
57:             try {
58:                    if (cosDoc != null) cosDoc.close();
59:                    if (pdDoc != null) pdDoc.close();
60:                } catch (Exception e1) {
61:                e.printStackTrace();
62:             }
63:             return null;
64:         }
65:         System.out.println("Done.");
66:         return parsedText;
67:     }
68:
69:     // Write the parsed text from PDF to a file
70:     void writeTexttoFile(String pdfText, String fileName) {
71:   
72:         System.out.println("\nWriting PDF text to output text file " + fileName + "....");
73:         try {
74:             PrintWriter pw = new PrintWriter(fileName);
75:             pw.print(pdfText);
76:             pw.close();  
77:         } catch (Exception e) {
78:             System.out.println("An exception occured in writing the pdf text to file.");
79:             e.printStackTrace();
80:         }
81:         System.out.println("Done.");
82:     }
83:
84:     //Extracts text from a PDF Document and writes it to a text file
85:     public static void main(String args[]) {
86:   
87:         if (args.length != 2) {
88:             System.out.println("Usage: java PDFTextParser  ");
89:             System.exit(1);
90:         }
91:   
92:         PDFTextParser pdfTextParserObj = new PDFTextParser();
93:         String pdfToText = pdfTextParserObj.pdftoText(args[0]);
94:   
95:         if (pdfToText == null) {
96:             System.out.println("PDF to Text Conversion failed.");
97:         }
98:         else {
99:             System.out.println("\nThe text parsed from the PDF Document....\n" + pdfToText);
100:             pdfTextParserObj.writeTexttoFile(pdfToText, args[1]);
101:         }
102:     }
103: }

Explanation:

The above code takes two command line parameters, the input PDF file and the output text file, the method pdftoText in line 31 handles the text parsing functionality and the writeTexttoFile method in line 70 writes the parsed text to the output file.

Compliling and Running the code:

I used PDFBox 0.7.3 to compile/run the above code, so you need to add those jars in your java project settings.

1. Download PDFBox 0.7.3 from
here.
2. Unzip PDFBox-0.7.3.zip.
3. Under the PDFBox-0.7.3 folder, add the jars in the lib (PDFBox-0.7.3.jar) and external directory (other external packages used by PDFBox-0.7.3) to the classpath to compile/run the code, it should work fine.

Note: I used JDK 1.6 to compile the above code.

對這文章發表回應

PDF Text Parser: Converting PDF to Text in Java using PDFBox