How To Extract Text From PDF In Python

How To Extract Text From PDF In Python

PDF(Portable Document Format) is the file format developed by Adobe in the 1990s. At the present time, we all are familiared with its huge popularity in read-only documents.

In python, there are lots of packages available in PyPI for extracting text from pdf like pdfplumber, pdfminer, pypdf2, slate, pdfquery, xpdf, tectract and so on.

Here, in this article we will be going to use PyPDF2 module for following things:

1) Extracting text

2) Copying pages

3) Rotating pages

4) Encrypting pdf

Installation

pip install PyPDF2

1) Extracting text

We can extract text from specific page or whole pages.

Note: PyPDF2 does not extract images, charts and media files. It only extract text and return as python string.

Extracting specific page
# import module PyPDF2
import PyPDF2

# put 'example.pdf' in working directory
# and open it in read binary mode
pdfFileObj = open('example.pdf', 'rb')

# call and store PdfFileReader
#  object in pdfReader
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

# to print the total number of pages in pdf
# print(pdfReader.numPages)

# get specific page of pdf by passing
#  number since it stores pages in list
#  to access first page pass 0
pageObj = pdfReader.getPage(0)

# extract the page object
#  by extractText() function
texts = pageObj.extractText()

# print the extracted texts
print(texts)
Extracting all pages
import PyPDF2

pdffile = open('example.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdffile)
num_pages = pdfReader.numPages
count = 0

# while loop will read each page.
while count < num_pages:
    texts = " "
    pageObj = pdfReader.getPage(count)
    count += 1
    texts = pageObj.extractText()
    print('Page number:', count)
    print(texts)

2) Copying pages

Here, we copy pages of two PDF files named ‘example1.pdf’ and ‘example2.pdf’ and merged them to newly created file named  ‘example3.pdf’.

import PyPDF2

# open two pdfs
pdf1File = open('example.pdf', 'rb')
pdf2File = open('example2.pdf', 'rb')

# read first pdf
pdf1Reader = PyPDF2.PdfFileReader(pdf1File)
# read second pdf
pdf2Reader = PyPDF2.PdfFileReader(pdf2File)
# for writing in new pdf file
pdfWriter = PyPDF2.PdfFileWriter()

for pageNum in range(pdf1Reader.numPages):
    pageObj = pdf1Reader.getPage(pageNum)
    pdfWriter.addPage(pageObj)

for pageNum in range(pdf2Reader.numPages):
    pageObj = pdf2Reader.getPage(pageNum)
    pdfWriter.addPage(pageObj)

# create new pdf 'example3.pdf' 
pdfOutputFile = open('example3.pdf', 'wb')

pdfWriter.write(pdfOutputFile)
pdfOutputFile.close()
pdf1File.close()
pdf2File.close()

Now we can see new pdf ‘example3.pdf’ in working directory.

Note: In PyPDF2, we cannot insert pages to the middle of PdfFileWriter object.

3) Rotating pages

PyPDF2 comes with two methods for rotating pdf pages.

rotateCounterClockwise() : Rotates a page counter-clockwise by increments of 90 degrees.

rotateClockwise() : Rotates a page clockwise by increments of 90 degrees.

import PyPDF2

pdfFile = open('example.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFile)

# rotating first page
#  of 'example.pdf' only
page = pdfReader.getPage(0)

# rotating clockwise by 90
page.rotateClockwise(90)

# rotating counter-clockwise by 270
# page.rotateCounterClockwise(270)

# creating object 'pdfWriter'
#  to add rotated page
pdfWriter = PyPDF2.PdfFileWriter()
pdfWriter.addPage(page)

# create new pdf
pdfOutputFile = open('rotated-example.pdf', 'wb')
pdfWriter.write(pdfOutputFile)
pdfOutputFile.close()
pdfFile.close()

4) Encrypting pdf

To protect pdf file being access by anyone, PyPDF2 provides us with the facility of encrypting the pdf with password.

import PyPDF2

pdfFileObj = open('example.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pdfWriter = PyPDF2.PdfFileWriter()

for pageNum in range(pdfReader.numPages):
    pdfWriter.addPage(pdfReader.getPage(pageNum))

pdfWriter.encrypt('abc')
resultPdf = open('encrypted-example.pdf', 'wb')
pdfWriter.write(resultPdf)
resultPdf.close()

Now we can see that in working directory new pdf file named ‘encrypted-example.pdf’ is created. As we set the password of newly created pdf file as “abc”. Whenever we try to open that pdf we have to enter the password as:

encrypted pdf

References

https://automatetheboringstuff.com/chapter13/

https://pythonhosted.org/PyPDF2/

Happy Learning 🙂

Leave a Reply