Automating the PDF file content validation — Selenium TestNG

3 min readAug 27, 2021

We come across multiple scenarios where PDF files need to be validated through automation scripts. The PDF files can be in various forms like Receipts, Invoices, Bank Statements and so on so forth.

This can be achieved with a third party java library — Apache PDFBox.

Basically it provides the capability to extract the content from the pdf documents.

There can be 2 possibilities to find the file

The file is stored in the local machine
The file opens in a new tab after clicking on the hyperlink.

The above problem can be solved in 7 simple steps.

Step 1: Download the Apache PDFBox maven dependency.

Step 2: Create an object of the URL class

Pass the local path if the file is stored in the local machine:

2. Pass the complete path if the file opens in the browser:

Step 3: Open the stream of the PDF file using openStream class and save it into the InputStream object.

Step 4: Create a bufferedInputStream object and pass the InputStream class object reference to it.

Step 5: Create a PDDocument object, use load() method and pass the BufferedInputStream object reference to it.

Step 6: Use the PDFTextStripper().getText() method to fetch the content of the pdf document and store it into a string.

Step 7: The job is done till step 6. We have got the content of the PDF document into a string variable. In this last step we just need to do the assertions:

Well, this is it. It looks pretty simple, isn’t it? But everything is not hunky-dory about PDF file validations. First thing first, not all the PDF files can be validated. There are PDF files which are converted from image files. Such files cannot be automated. Also, if the file is very large, it might add some lag into your use case.

Thanks for reading.

Automating the PDF file content validation — Selenium TestNG

Written by HARSHVARDHAN SINGH CHAUHAN

No responses yet