Splitting XML Documents at Milestone Elements

Using the XSLT Upward Projection Method

Gerrit Imsieke (@gimsieke), le-tex publishing services (@letexml)

Problems addressed / Applications

  • Splitting TEI documents at page breaks
  • Put Two-Column Regions into Separate XSL-FO Blocks
  • Split at Line Breaks, Excluding Footnotes, List Items, etc.

Also:

  • Performance analysis
  • Streamability?
  • Support arbitrary XML vocabularies (dynamic XPath evaluation)

Eliot Kimber’s FO Block Splitting Woes

DITA OT Day 2018 video (~2:30–4:25), xsl-list post

Eliot Kimber’s DITA OT Day 2018 presentation

Geert Borman’s Split-at-Page-Break Woes

“breaking up XML on page break element”

(Geert’s 2014-07-04 XSL Mailing list post)

An “overlapping markup” problem (page division vs. document hierarchy)

Test Document for Splitting at Page Breaks

Martin Luther’s translation of the New Testament into German (1522), TEI P5 XML from Deutsches Textarchiv

Luther’s New Testament as TEI XML

452 pb milestone elements at varying depths

<pbs>
  <pb path="/TEI/text/body/div/div/p/pb" count="238"/>
  <pb path="/TEI/text/body/div/div/pb" count="91"/>
  <pb path="/TEI/text/body/pb" count="52"/>
  <pb path="/TEI/text/body/div/pb" count="47"/>
  <pb path="/TEI/text/front/pb" count="11"/>
  <pb path="/TEI/text/body/div/p/pb" count="10"/>
  <pb path="/TEI/text/front/div/p/pb" count="3"/>
</pbs>

Luther’s New Testament as TEI XML

Simplified/Abridged Tree View

The XSLT Upward Projection Method

The XSLT Upward Projection Method

Process /TEI in split mode

with a tunneled $restricted-to parameter

(omitting xsl:result-document here for brevity)

The “Conditional Identity Template”

Group 1

Group 2

Group 3

Group 4

Group 5

Splitting Result

The teiHeader would be missing…

…if it weren’t for this template:

The other applications

FO block splitting

Nested grouping (group-starting-with for <two-col-start>, group-ending-with for <two-col-end>)

Split at line breaks

Avoid splitting at line breaks in embedded list items or footnotes by

  • modifying the leaf selection XPath to treat lists and footnotes as leaf-like
  • switching back from split to #default mode when processing list item / footnote content

Performance Measurements

Luther’s 1522 New Testament translation:

  • Number of pb elements: 452
  • Number of leaf nodes: 77,649

Hypothesis:

  • Performance depends linearly on doc size (node count)
  • and maybe on number of splitting points

~ Linear Dependence on Doc Size

Dependence on Number of Splitting Points?

  • Created docs with the first 10, 50, 100, 200, 300, 375 pages
  • using a modified upward projection XSLT

Surprise: 1st 10 pages: milliseconds, 1st 375 pages: minutes

⇒ Need to measure dependence on chunk size at constant doc length

Dependence on Number of Chunks at Constant Doc Length

Repeatedly removing every other pb results in fewer chunks

Dependence on Chunk Size at Constant Doc Length

Profiling

Culprit: Conditional Identity Template

When chunk length grows…

  • …the conditional identity template is called slightly fewer times
  • …however, each invocation takes significantly longer
  • resulting in (for chunk sizes > 5000 leaves) linear dependence of running time on chunk size
  • Still, certain dependence on the number of splitting points for smaller chunks

Remedies

The number of Conditional Identity Template invocations cannot be reduced

Pass generated IDs instead of nodes, compare generate-id() = $restricted-to instead of exists(. intersect $restricted-to)

⇒ 20-fold acceleration for large chunks

Streaming

No. Michael Kay wrote in 2014:

“... the real problem is that the logic is going down to descendants, then up to their ancestors, and then down again, and that's intrinsically not processing nodes in document order, which is a precondition for streaming.”

Even if it were feasible, the scaling with chunk size would be detrimental.

Dynamic XPath Evaluation

Can we have a configurable splitter for JATS, DocBook, TEI, HTML etc.?

  • Yes, works well!
  • No prohibitive performance penalty (runs 1.5 to 2 times as long as bespoke stylesheets)
  • Best delivered as XSLT 3.0 packages with a final entry template and private internal templates

XSLT Pays Off!