A Regex-Based Algorithm for Splitting PDFs

This is the first part of a two-part article where we are implementing a PDF splitter. Splitters are common PDF applications that make it possible to select pages from a source file and extract them into a new document.

A Regex-Based Algorithm for Splitting PDFs

They may come in a number of flavors and can consist of tools that generate multiple PDF destination files right up to simple apps extracting pages to a single destination file.

After this brief introduction let’s move on to other ideas and start talking about domain logic.

What’s the Domain Logic?

In the previous post we mentioned the word “domain logic” which is a software engineering term that can also be referred to as business logic. Sound overwhelming?

Don’t get overwhelmed by this. You’ll find plenty of articles on the Internet trying to explain what the domain logic is as well as many discussion threads still deciphering this esoteric issue.

Confused

In a nutshell, the domain logic describes the core stuff of your app. That’s it! Well, the problem is that as the complexity of your app grows it’s often necessary to code more and more stuff, so you may end up wondering: Where is my app’s core stuff now?

Anyway, today’s PDF splitter is simple and will help you better understand what the domain logic is.

Let’s Have a Look at the GUI

In this first of two parts we are focusing on the Java algorithm (as this is a core part of the app we say this algorithm is part of the domain logic) that is able to understand which pages the user wants to extract from their original source file.

In our example, the pages to be extracted from the original document will be typed in a text field.

Figure 1. The user is asked to pick a PDF file and type the pages to be extracted

Figure 1. The user is asked to pick a PDF file and type the page numbers to be extracted

Notice that there’s a friendly message giving some indications about what it is expected to be typed: “Either select your pages one by one separated by commas (,) or select a range of pages by using a hyphen (-)”

Here come some examples of valid inputs:

  • 1,2,3
  • 25-45
  • 2-7, 17-21
  • 2-7, 17-21, 23,24,25
  • 2-7, 17-21, 35-43

The important thing that we are doing right now is translating the input string into a list of integers so that we can comfortably work with the data from within our app. We are not reinventing the wheel; in fact this is a common idea in the world of programming: Encapsulate data into programmatic structures such as arrays, lists, dictionaries, trees, etc.

Building a Java ArrayList of Integer values

Ok, so after having done the analysis above, now that we are clear on how this program behaves from the user’s perspective, let’s look at this Java code:

This Java algorithm stores the page numbers selected into an ArrayList of Integer values. Once again, the goal is to encapsulate “what the user wants” into a Java data structure.

Regular expressions

At the heart of this code is the regular expression:

In case you don’t know, regular expressions are a highly specialized language for describing multiple text strings with one single text pattern. They are present in every programming language: Java, C, JavaScript, PHP, etc.

Let’s recognize, by the way, that programmers may get stuck with regular expressions because they are hard to master, some even try to avoid them.

Since we are still in the development stage, our code snippet is not run in the context of any production app. We are just testing “the core stuff” and displaying some results.

For example:

1,2,3

2-7, 17-21

2-7, 17-21, 23,24,25

2-7, 17-21, 35-43

The five examples above match the pattern "([0-9]+[,-]{0,1})+", but what happens with the inputs listed below?

23/45/67

Hello world

10,,20,,30

This time as the strings "23/45/67", "Hello world" and "10,,20,,30" don’t match the regex pattern, the output displayed is "The set of pages is not correct, please try again."

So that’s all for now! The next post will discuss how to combine this algorithm with PlugPDF’s built-in PDFDocument object in order to create a practical PDF Splitter. It will only take a few lines of code to implement!

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *