Regular Expressions for Document Engineering

Gregory M. Kapfhammer

November 10, 2025

Course learning objectives

Learning Objectives for Document Engineering

  • CS-104-1: Explain processes such as software installation or design for a variety of technical and non-technical audiences ranging from inexperienced to expert.
  • CS-104-2: Use professional-grade integrated development environments (IDEs), command-line tools, and version control systems to compose, edit, and deploy well-structured, web-ready documents and industry-standard documentation tools.
  • CS-104-3: Build automated publishing pipelines to format, check, and ensure both the uniformity and quality of digital documents.
  • CS-104-4: Identify and apply appropriate conventions of a variety of technical communities, tools, and computer languages to produce industry-consistent diagrams, summaries, and descriptions of technical topics or processes.
  • This week’s content aids attainment of CS-104-2, CS-104-3, and CS-104-4!

Creating cool regular expressions in Python

  • Define a regular expression pattern
  • Use the re module for regex operations
  • Compile patterns with re.compile()
  • Raw strings (r'') prevent escape confusion

Basic regular expression steps

  • Import the module: start with import re
  • Define pattern: use raw string like r'pattern'
  • Compile pattern: create regex object with re.compile(pattern)
  • Apply pattern: run match(), search(), or findall()
  • Extract results: process match objects or lists of matches
  • Test and debug: verify with various input cases of strings

How do we create effective patterns? How do we test that patterns work correctly? How do we debug when patterns fail? How do we optimize patterns for performance? How do we reliably use them in programs?

Creating your first regular expression

  • \d matches any digit (0-9)
  • {3} means exactly 3 occurrences
  • The pattern requires hyphens at specific positions
  • Does this work for a wide variety of phone numbers? Well, try it out!

Testing regular expressions in Python

Simple pattern matching for email

Revisit the phone regular expression

  • Enhance: can hyphens be optional? Allow dots or spaces as separators?
  • Question: What are the benefits and drawbacks of regular expressions?

Key components of a regular expression

  • Literal characters: match exact text
  • Metacharacters: special meaning symbols
  • Character classes: sets of characters
  • Quantifiers: specify amount of repetition

Regular expression notation

  • Pattern matching: describe sets of strings concisely
  • Practical extensions of “basic” regular expressions:
    • . means any char
    • + means one or more
    • [...] is a character class
    • [a-z] matches any character in range
    • [^abc] matches any character except those listed
    • These are all “syntactic sugar” for convenience
  • Can you write an improved regex for email addresses?
  • How do you test a regular expressions’ correctness?

Understanding regex metacharacters

  • . matches any single character except newline
  • ^ matches start of string
  • $ matches end of string
  • * matches zero or more repetitions
  • + matches one or more repetitions
  • ? matches zero or one repetition
  • {n} matches exactly n repetitions
  • {n,m} matches between n and m repetitions
  • \ escapes special characters like \. to match a literal dot

Character classes in regex

  • [abc] matches any single character a, b, or c
  • [a-z] matches any lowercase letter
  • [A-Z] matches any uppercase letter
  • [0-9] matches any digit
  • [^abc] matches any character except a, b, or c
  • \d matches any digit (this is the same as [0-9])
  • \w matches word characters (i.e., letters, digits, and underscore)
  • \s matches whitespace (i.e., spaces, tabs, and newlines)
  • \D, \W, \S are negations of the above three classes

Explore quantifiers like * and +

Further exploration of quantifiers

  • Experiment: try changing the pattern to \d+ or \d* to see how matching behavior changes! What did you discover and learn?

Use regular expressions for pattern matching

  • Search: find pattern anywhere in string
  • Match: check if pattern starts string
  • Find all: extract all matches for pattern
  • Replace: substitute matched patterns

Key regex methods in Python

  • re.match(pattern, string): checks if pattern matches at start of string
  • re.search(pattern, string): finds first occurrence of pattern anywhere
  • re.findall(pattern, string): returns list of non-overlapping matches
  • re.finditer(pattern, string): returns iterator of match objects
  • re.sub(pattern, repl, string): replaces matches with new string
  • re.split(pattern, string): splits string by pattern occurrences
  • pattern.fullmatch(string): checks if entire string matches pattern
  • Explore: How can pattern matching aid the implementation of your document engineering project? What are new features that you could add? How would you test them to ensure system correctness?

Search versus match methods

  • Search is most flexible for finding patterns
  • Match checks specific positions
  • Fullmatch requires exact pattern conformity

Finding all matches with findall

import re

text = 'Contact us at support@example.com or sales@company.org'
email_pattern = r'\b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b'

# find all email addresses
emails = re.findall(email_pattern, text)
print(f"Found {len(emails)} email addresses:")
for email in emails:
    print(f"  - {email}")
Found 2 email addresses:
  - support@example.com
  - sales@company.org
  • \b represents word boundaries
  • + matches one or more characters
  • Character classes [...] define valid characters

Try (again) to extract emails!

  • Modification: change the text to include your own suitable emails
  • Exploration: identify some email addresses that are not detected
  • Extension: try to make pattern detection for emails more robust

Regular expressions for document engineering

  • Extract metadata: parse dates, versions, identifiers
  • Validate input: confirm format compliance
  • Clean text: remove unwanted patterns
  • Analyze content: find patterns in documents
  • Confirm correctness: test to be confident in correctness
  • Ensure understanding: ensure you understand the regex

Extracting dates from documents

  • Prosegrammers extract structured data from unstructured text
  • Different date formats (e.g., ISO versus US) require different patterns

Cleaning markdown formatting

Validating document structure

  • Document engineering tools verify structure correctness
  • Regex helps enforce formatting conventions
  • Make sure that your patterns are tested and reliable
  • Aim to avoid false positives and false negatives

How can we test regular expressions?

  • Unit tests: verify pattern correctness
  • Test cases: positive and negative examples
  • Edge cases: empty strings or special characters
  • Frameworks: use unittest or pytest

Testing regex with unittest

import unittest
import re

class TestEmailRegex(unittest.TestCase):
    def setUp(self):
        """Set up test fixtures."""
        self.email_pattern = r'\b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b'
        self.regex = re.compile(self.email_pattern)
    
    def test_valid_email(self):
        """Test that valid emails match."""
        self.assertTrue(self.regex.fullmatch('user@example.com'))
        self.assertTrue(self.regex.fullmatch('test.user@domain.co.uk'))
    
    def test_invalid_email(self):
        """Test that invalid emails do not match."""
        self.assertFalse(self.regex.fullmatch('invalid@'))
        self.assertFalse(self.regex.fullmatch('@example.com'))
        self.assertFalse(self.regex.fullmatch('no-at-sign.com'))

unittest.main(argv=['ignored'], verbosity=2, exit=False)
<unittest.main.TestProgram at 0x7f00c441c320>

Testing regex patterns systematically

  • Positive tests: verify pattern matches valid inputs
  • Negative tests: ensure pattern rejects invalid inputs
  • Edge cases: test empty strings, very long inputs, special characters
  • Boundary tests: check minimum and maximum length requirements
  • Real-world data: use actual document samples for testing
  • Test coverage: ensure all pattern components are tested
  • Refactoring: change patterns confidently with comprehensive tests

Well-tested regex patterns make document engineering tools reliable and maintainable! Please test all methods that use regular expressions!

Testing date extraction function

import unittest
import re

def extract_dates(text: str) -> list:
    """Extract all ISO format dates from text."""
    pattern = r'\d{4}-\d{2}-\d{2}'
    return re.findall(pattern, text)

class TestDateExtraction(unittest.TestCase):
    def test_single_date(self):
        """Test extraction of single date."""
        result = extract_dates("Meeting on 2024-03-15")
        self.assertEqual(result, ['2024-03-15'])
    
    def test_multiple_dates(self):
        """Test extraction of multiple dates."""
        text = "From 2024-01-01 to 2024-12-31"
        result = extract_dates(text)
        self.assertEqual(len(result), 2)
    
    def test_no_dates(self):
        """Test text with no dates."""
        result = extract_dates("No dates here")
        self.assertEqual(result, [])

unittest.main(argv=['ignored'], verbosity=2, exit=False)
<unittest.main.TestProgram at 0x7f00c41f3860>

Practical regex testing strategies

  • Start simple: test basic cases before complex ones
  • Use online tools: regex101.com for pattern debugging
  • Document patterns: add comments explaining regex logic
  • Version patterns: track changes to regex as requirements evolve
  • Benchmark performance: test speed with large documents
  • Handle errors: use try-except blocks for malformed input
  • Share test data: maintain test document collection for validation

Testing regex is essential because patterns can be complex and hard to understand — subtle bugs can hide in tricky edge cases!

Benefits and limitations of regular expressions

  • Benefits:
    • Powerful pattern matching in compact syntax
    • Built-in support across programming languages
    • Fast for many text processing tasks
    • Great for validation and extraction
  • Limitations:
    • Can become complex and hard to read
    • Not suitable for parsing nested structures
    • Performance issues with catastrophic backtracking

When to use regex for documents

  • Good use cases:
    • Extracting emails, URLs, dates from text
    • Validating input formats (e.g., simple phone numbers or IDs)
    • Simple text cleaning and normalization
    • Finding keywords or patterns in documents
    • Basic markdown or syntax highlighting
  • Consider alternatives for:
    • Parsing HTML or XML (use BeautifulSoup or lxml)
    • Complex nested structures (use Markdown parser)
    • Full language parsing (use AST-based tools)
    • Large-scale text analysis (use NLP libraries)

Regex best practices

  • Use raw strings with r'' to avoid escape character confusion
  • Compile patterns once and reuse for better performance
  • Add comments to explain complex patterns
  • Test patterns thoroughly with diverse inputs
  • Use named groups for readability: (?P<name>...)
  • Keep patterns simple; split complex logic into multiple patterns
  • Use online tools like regex101 for development and testing
  • Document assumptions about input format
  • Handle edge cases gracefully in production code
  • Again, write tests for all functions using regex!

Find an open-source Python project that contains a regular expression!

  • What did you find? How does it work?
  • What are the benefits and limitations?
  • Share the link and a code segment
  • Here is an example from GatorGrader:
MULTILINECOMMENT_RE_JAVA = r"""/\*([^*]|[\r\n]|(\*+([^*/]|[\r\n])))*\*+/"""
SINGLELINECOMMENT_RE_JAVA = r"""^(?:[^"/\\]|\"(?:[^\"\\]|\\.)*
\"|/(?:[^/"\\]|\\.)|/\"(?:[^\"\\]|\\.)*\"|\\.)*//(.*)$"""
SINGLELINECOMMENT_RE_PYTHON = r"""^(?:[^"#\\]|\"(?:[^\"\\]|\\.)*\"|
/(?:[^#"\\]|\\.)|/\"(?:[^\"\\]|\\.)*\"|\\.)*#(.*)$"""
MULTILINECOMMENT_RE_PYTHON = r'^[ \t]*"""(.*?)"""[ \t]*$'

Course goals reminder

  • Document Creation:
    • Design and implement document generation workflows
    • Test all aspects of documents to ensure quality and accuracy
    • Create frameworks for automated document production
  • Document Analysis:
    • Collect and analyze data about document usage and quality
    • Visualize insights to improve documentation strategies
  • Document Processing:
    • Use regex for pattern matching and text extraction
    • Build validation tools for document structure
    • Clean and normalize document content programmatically
  • Check syllabus for details about Document Engineering course!

Regular expressions aid prosegramming

  • Next steps with regular expression techniques:
    • Find locations in your tool where regex could add value:
      • Could pattern matching help validate input formats?
      • Would text extraction improve document processing?
      • Could regex speed up an automated content analysis?
    • Combine multiple patterns for powerful document tools
    • How would regex make your document tools more intelligent?

Key takeaways for prosegrammers

  • Understand regular expression basics
    • Use raw strings (r'') and re.compile() for clarity and reuse
    • Know metacharacters, character classes, and quantifiers
  • Apply regular expressions thoughtfully
    • Use search, match, findall, and sub appropriately
    • Prefer specialized parsers for complex formats (e.g., HTML or Markdown)
  • Test and validate patterns
    • Write unit tests with positive, negative, and edge cases
    • Benchmark patterns to avoid “catastrophic backtracking”
  • Practical prosegrammer tips
    • Document and comment complex patterns and use named groups
    • Compile once and reuse patterns for performance
    • Use real-world sample data when testing