處理PDF - Python文本處理教學

Python可以從中提取文本後讀取PDF檔並列印出內容。為此，必須首先安裝所需的模組PyPDF2，以下是安裝模組的命令。應該已經在python環境中安裝了pip。

pip install pypdf2

成功安裝此模組後，可以使用模組中提供的方法讀取PDF檔。

import PyPDF2

pdfName = 'path\zaixianpoint.pdf'
read_pdf = PyPDF2.PdfFileReader(pdfName)
page = read_pdf.getPage(0)
page_content = page.extractText()
print page_content

當運行上面的程式時，我們得到以下輸出 -

zaixian Point originated from the idea that there exists a class of readers who respond better
to online content and prefer to learn new skills at their own pace from the comforts of their
drawing rooms.

The journey commenced with a single tutorial on HTML in 2006 and elated by the response
it generated, we worked our way to adding fresh tutorials to our repository which now
proudly flaunts a wealth of tutorials and allied articles on topics ranging from programming
languages to web designing to academics and much more.

讀取多個頁面

要閱讀包含多個頁面的pdf並使用頁碼列印每個頁面，使用帶有getPageNumber()函數的迴圈。在下面的例子中有兩個頁面的PDF檔。內容在兩個單獨的頁面標題下列印。

import PyPDF2

pdfName = 'Path\zaixianspoint2.pdf'
read_pdf = PyPDF2.PdfFileReader(pdfName)

for i in xrange(read_pdf.getNumPages()):
    page = read_pdf.getPage(i)
    print 'Page No - ' + str(1+read_pdf.getPageNumber(page))
    page_content = page.extractText()
    print page_content

執行上面示例代碼，得到以下結果 -

Page No - 1
zaixian Point originated from the idea that there exists a class of readers who respond better to
online content and prefer to learn new skills at their own pace from the comforts of their drawing
rooms.


Page No - 2

The journey commenced with a single tutorial on HTML in 2006 and elated by the response it
generated, we worked our way to adding fresh tutorials to our repository which now proudly flaunts
a wealth of tutorials and allied articles on topics ranging from p
rogramming languages to web
designing to academics and much more.

HTML / CSS

腳本語言

高級語言

Java技術

XML技術

大數據

開發工具

框架

軟體測試

前端技術

資料庫

其他技術