Using Data Containers for Document Engineering

Gregory M. Kapfhammer

October 13, 2025

Data containers

  • What is document engineering?
    • Creating documents using code
    • Manipulating and analyzing text data
    • Building documentation systems
    • “Prosegrammers” combine prose and programming
  • What are this week’s highlights?
    • Explore Python data containers for document engineering
      • Lists for organizing document collections and sections
      • Tuples for storing document metadata immutably
      • Sets for managing unique keywords and tags

Key insights for prosegrammers

  • Document engineering means blending code and prose to build resources for both humans and machines
  • Python containers organize document data efficiently: lists for sequences, tuples for records, sets for uniqueness
  • Data containers can store multiple documents, in different formats, with different data and metadata

Python collections overview

  • Lists: mutable sequences for document sections
    • Store chapters, paragraphs, or document versions
    • Perfect for ordered content that may change
    • Support appending, removing, and modifying elements
  • Tuples: immutable records for document metadata
    • Store title, author, date information safely
    • Guaranteed not to change accidentally
    • Efficient for fixed document properties
  • Sets: unique collections for document keywords
    • Eliminate duplicate tags automatically
    • Fast membership testing and set operations
    • Perfect for managing document categories

Using lists in Python

  • Creating document collections
    • Store related files in ordered sequences
    • Build documentation hierarchies
  • Modifying document structures
    • Add, remove, and reorganize content
    • Update documentation dynamically
  • Accessing document elements
    • Find specific documents by position
    • Process collections systematically

Basic list operations

  • Create document list, iterate with for loop, and then display details

Two-dimensional lists

  • Create a list of lists, iterate with for loop, and then display details

Modifying lists dynamically

Lists for document engineering

  • Document collections: store related files in order
  • Dynamic modification: add, insert, and remove content
  • Flexible organization: restructure documents as needed
  • Index-based access: retrieve specific sections efficiently
  • Next steps for understanding how to use lists:
    • Find a location in your document engineering tool where you used lists
      • Is it working correctly?
      • How did you test and debug it?
      • How can you refactor the code?
    • If you did not find a list being used, how could you add one to your project?

Using lists in Python for document engineering

  • Input one or more documents from the file system
  • Parse each document to a data structure instance
  • Store all data structures for each document in a list
  • Iterate through the list to process all data structures
  • Output the results of the analysis to the console
  • How to extend your tool to handle multiple files?

Tuples for storing immutable metadata

  • Creating immutable records
    • Store document properties safely
    • Prevent accidental data changes
  • Organizing metadata collections
    • Build consistent document catalogs
    • Maintain data integrity
  • Analyzing document metrics
    • Extract statistics from records
    • Process structured data efficiently

Basic tuple operations

  • Create metadata tuple, iterate with for loop, and then display details
  • How is a tuple different from a list? How do we use it differently?

Document analysis with tuples

  • Create list of tuples, iterate with for loop, and then display details
  • Many combinations of data structures (e.g., lists and tuple) are possible!

Immutability versus mutable contents

  • The TypeError shows that tuple elements cannot be reassigned
  • Yet, and append call modifies the list in place, changing meta’s contents
  • This shows that tuples are immutable, but their contents may be mutable

Tuples for document engineering

  • Immutable records: metadata cannot be accidentally changed
  • Structured data: consistent format for document properties
  • Tuple unpacking: easy extraction of individual values
  • Statistical analysis: compute metrics across documents
  • Next steps for understanding how to use tuples:
    • Find a location in your document engineering tool where you used tuples
      • Is it working correctly?
      • How did you test and debug it?
      • How can you refactor the code?
    • If you did not find a tuple being used, how could you add one to your project?

Sets for storing document keywords

  • Managing unique keywords
    • Eliminate duplicate tags automatically
    • Build clean tag collections
  • Performing set operations
    • Find common and unique tags
    • Analyze document relationships
  • Categorizing documents
    • Organize content by complexity
    • Create document taxonomies

Basic set operations for documents

  • Create list of tags for multiple documents under analysis
  • Find those tags that are unique and those that are common

Sets for document engineering

  • Automatic uniqueness: no duplicate tags allowed
  • Set operations: union, intersection, difference for analysis
  • Tag management: organize and categorize document content
  • Membership testing: quickly check document existence
  • Next steps for understanding how to use sets:
    • Find a location in your document engineering tool where you used sets
      • Is it working correctly?
      • How did you test and debug it?
      • How can you refactor the code?
    • If you did not find a set being used, how could you add one to your project?

Container review

Container Mutable Ordered Duplicates Best For
List Yes Yes Allowed Document sections, chapters, file collections
Tuple No Yes Allowed Document metadata, fixed records, coordinates
Set Yes No Not Allowed Keywords, tags, unique identifiers

Summary of data container choices

  • Lists: when you need to modify and maintain order
  • Tuples: when data should never change
  • Sets: when uniqueness matters most
  • Customize your own data containers to meet your tool’s needs!
  • Next steps for understanding how to use containers:
    • Think of a feature for your document engineering tool needing a container:
      • How would your tool’s feature use a container?
      • If you could use a container, what would be the benefit?
      • What type of container would you pick for this feature?
      • How would you test to confirm that the container works correctly?

Enhanced document analysis

  • Create list of sample documents that contain simple text
  • Determine the number of unique words across all of the documents

Key takeaways for prosegrammers

  • Choose the right container
    • Lists for document sequences and mutable collections
    • Tuples for immutable metadata and structured records
    • Sets for unique keywords, tags, and categories
  • Master container operations
    • Create, access, modify, and analyze document data
    • Use indexing, slicing, and iteration effectively
    • Apply set operations for document categorization
  • Think and act like a prosegrammer
    • Combine containers to solve complex document analysis challenges
    • Use type hints to make your Python code clear and maintainable
    • Apply containers to handle real-world document engineering challenges