茫茫網海中的冷日 - 對這文章發表回應
茫茫網海中的冷日
         
茫茫網海中的冷日
發生過的事,不可能遺忘,只是想不起來而已!
 恭喜您是本站第 1673014 位訪客!  登入  | 註冊
主選單

Google 自訂搜尋

Goole 廣告

隨機相片
CHIMEI_TL32V7500D_00032.jpg

授權條款

使用者登入
使用者名稱:

密碼:


忘了密碼?

現在就註冊!

對這文章發表回應

發表限制: 非會員 可以發表

發表者: 冷日 發表時間: 2012/7/5 14:19:37

PDF Text Parser: Converting PDF to Text in Java using PDFBox

Converting PDF to text is an interesting task which has its use in many applications from search engines indexing PDF documents to other data processing tasks. I was looking for a java based API to convert PDF to text, or in other words a PDF Text parser in java, after going through many articles, the PDFBox project came to my rescue. PDFBox is a library which can handle different types of PDF documents including encrypted PDF formats and extracts text and has a command line utility as well to convert PDF to text documents.


I found the need to have a reusable java class to convert PDF Documents to text in one of my projects and the below java code does the same using the PDFBox java API. It takes two command line parameters, the input PDF file and the output text file, to which the parsed text from the PDF document will be written.

This code was tested with PDFBox 0.7.3 although it should work with other versions of PDFBox as well, it can be easily integrated with other java applications and can be used as a command line utility as well, the steps to run this code is furnished below.

Listing 1:
PDFTextParser.java


1: /*
2: * PDFTextParser.java
3: * Author: S.Prasanna
4: *
5: */
6:
7: import org.pdfbox.cos.COSDocument;
8: import org.pdfbox.pdfparser.PDFParser;
9: import org.pdfbox.pdmodel.PDDocument;
10: import org.pdfbox.pdmodel.PDDocumentInformation;
11: import org.pdfbox.util.PDFTextStripper;
12:
13: import java.io.File;
14: import java.io.FileInputStream;
15: import java.io.PrintWriter;
16:
17: public class PDFTextParser {
18:
19: PDFParser parser;
20: String parsedText;
21: PDFTextStripper pdfStripper;
22: PDDocument pdDoc;
23: COSDocument cosDoc;
24: PDDocumentInformation pdDocInfo;
25:
26: // PDFTextParser Constructor
27: public PDFTextParser() {
28: }
29:
30: // Extract text from PDF Document
31: String pdftoText(String fileName) {
32:
33: System.out.println("Parsing text from PDF file " + fileName + "....");
34: File f = new File(fileName);
35:
36: if (!f.isFile()) {
37: System.out.println("File " + fileName + " does not exist.");
38: return null;
39: }
40:
41: try {
42: parser = new PDFParser(new FileInputStream(f));
43: } catch (Exception e) {
44: System.out.println("Unable to open PDF Parser.");
45: return null;
46: }
47:
48: try {
49: parser.parse();
50: cosDoc = parser.getDocument();
51: pdfStripper = new PDFTextStripper();
52: pdDoc = new PDDocument(cosDoc);
53: parsedText = pdfStripper.getText(pdDoc);
54: } catch (Exception e) {
55: System.out.println("An exception occured in parsing the PDF Document.");
56: e.printStackTrace();
57: try {
58: if (cosDoc != null) cosDoc.close();
59: if (pdDoc != null) pdDoc.close();
60: } catch (Exception e1) {
61: e.printStackTrace();
62: }
63: return null;
64: }
65: System.out.println("Done.");
66: return parsedText;
67: }
68:
69: // Write the parsed text from PDF to a file
70: void writeTexttoFile(String pdfText, String fileName) {
71:
72: System.out.println("\nWriting PDF text to output text file " + fileName + "....");
73: try {
74: PrintWriter pw = new PrintWriter(fileName);
75: pw.print(pdfText);
76: pw.close();
77: } catch (Exception e) {
78: System.out.println("An exception occured in writing the pdf text to file.");
79: e.printStackTrace();
80: }
81: System.out.println("Done.");
82: }
83:
84: //Extracts text from a PDF Document and writes it to a text file
85: public static void main(String args[]) {
86:
87: if (args.length != 2) {
88: System.out.println("Usage: java PDFTextParser ");
89: System.exit(1);
90: }
91:
92: PDFTextParser pdfTextParserObj = new PDFTextParser();
93: String pdfToText = pdfTextParserObj.pdftoText(args[0]);
94:
95: if (pdfToText == null) {
96: System.out.println("PDF to Text Conversion failed.");
97: }
98: else {
99: System.out.println("\nThe text parsed from the PDF Document....\n" + pdfToText);
100: pdfTextParserObj.writeTexttoFile(pdfToText, args[1]);
101: }
102: }
103: }

Explanation:

The above code takes two command line parameters, the input PDF file and the output text file, the method pdftoText in line 31 handles the text parsing functionality and the writeTexttoFile method in line 70 writes the parsed text to the output file.

Compliling and Running the code:

I used PDFBox 0.7.3 to compile/run the above code, so you need to add those jars in your java project settings.

1. Download PDFBox 0.7.3 from
here.
2. Unzip PDFBox-0.7.3.zip.
3. Under the PDFBox-0.7.3 folder, add the jars in the lib (PDFBox-0.7.3.jar) and external directory (other external packages used by PDFBox-0.7.3) to the classpath to compile/run the code, it should work fine.

Note: I used JDK 1.6 to compile the above code.


原文出處:Techtalks: PDF Text Parser: Converting PDF to Text in Java using PDFBox
內容圖示
url email imgsrc image code quote
樣本
bold italic underline linethrough   












 [詳情...]
validation picture

注意事項:
預覽不需輸入認證碼,僅真正發送文章時才會檢查驗證碼。
認證碼有效期10分鐘,若輸入資料超過10分鐘,請您備份內容後,重新整理本頁並貼回您的內容,再輸入驗證碼送出。

選項

Powered by XOOPS 2.0 © 2001-2008 The XOOPS Project|