$le-tex logo$

Splitting XML Documents at Milestone Elements

Using the XSLT Upward Projection Method

Gerrit Imsieke (@gimsieke), le-tex publishing services (@letexml)

Problems addressed / Applications

Splitting TEI documents at page breaks
Put Two-Column Regions into Separate XSL-FO Blocks
Split at Line Breaks, Excluding Footnotes, List Items, etc.

Also:

Performance analysis
Streamability?
Support arbitrary XML vocabularies (dynamic XPath evaluation)

Eliot Kimber’s FO Block Splitting Woes

DITA OT Day 2018 video (~2:30–4:25), xsl-list post

Eliot Kimber’s DITA OT Day 2018 presentation

Geert Borman’s Split-at-Page-Break Woes

“breaking up XML on page break element”

(Geert’s 2014-07-04 XSL Mailing list post)

An “overlapping markup” problem (page division vs. document hierarchy)

Test Document for Splitting at Page Breaks

Martin Luther’s translation of the New Testament into German (1522), TEI P5 XML from Deutsches Textarchiv

Luther’s New Testament as TEI XML

452 pb milestone elements at varying depths

<pbs>
  <pb path="/TEI/text/body/div/div/p/pb" count="238"/>
  <pb path="/TEI/text/body/div/div/pb" count="91"/>
  <pb path="/TEI/text/body/pb" count="52"/>
  <pb path="/TEI/text/body/div/pb" count="47"/>
  <pb path="/TEI/text/front/pb" count="11"/>
  <pb path="/TEI/text/body/div/p/pb" count="10"/>
  <pb path="/TEI/text/front/div/p/pb" count="3"/>
</pbs>

Luther’s New Testament as TEI XML

Simplified/Abridged Tree View

The XSLT Upward Projection Method

Process `/TEI` in `split` mode

with a tunneled $restricted-to parameter

(omitting xsl:result-document here for brevity)

The “Conditional Identity Template”

Group 1

Group 2

Group 3

Group 4

Group 5

Splitting Result

The `teiHeader` would be missing…

…if it weren’t for this template:

The other applications

FO block splitting

Nested grouping (group-starting-with for <two-col-start>, group-ending-with for <two-col-end>)

Split at line breaks

Avoid splitting at line breaks in embedded list items or footnotes by

modifying the leaf selection XPath to treat lists and footnotes as leaf-like
switching back from split to #default mode when processing list item / footnote content

Performance Measurements

Luther’s 1522 New Testament translation:

Number of pb elements: 452
Number of leaf nodes: 77,649

Hypothesis:

Performance depends linearly on doc size (node count)
and maybe on number of splitting points

~ Linear Dependence on Doc Size

Dependence on Number of Splitting Points?

Created docs with the first 10, 50, 100, 200, 300, 375 pages
using a modified upward projection XSLT

Surprise: 1^st 10 pages: milliseconds, 1^st 375 pages: minutes

⇒ Need to measure dependence on chunk size at constant doc length

Dependence on Number of Chunks at Constant Doc Length

Repeatedly removing every other pb results in fewer chunks

Dependence on Chunk Size at Constant Doc Length

Profiling

Culprit: Conditional Identity Template

When chunk length grows…

…the conditional identity template is called slightly fewer times
…however, each invocation takes significantly longer
resulting in (for chunk sizes > 5000 leaves) linear dependence of running time on chunk size
Still, certain dependence on the number of splitting points for smaller chunks

Remedies

The number of Conditional Identity Template invocations cannot be reduced

Pass generated IDs instead of nodes, compare generate-id() = $restricted-to instead of exists(. intersect $restricted-to)

⇒ 20-fold acceleration for large chunks

Streaming

No. Michael Kay wrote in 2014:

“... the real problem is that the logic is going down to descendants, then up to their ancestors, and then down again, and that's intrinsically not processing nodes in document order, which is a precondition for streaming.”

Even if it were feasible, the scaling with chunk size would be detrimental.

Dynamic XPath Evaluation

Can we have a configurable splitter for JATS, DocBook, TEI, HTML etc.?

Yes, works well!
No prohibitive performance penalty (runs 1.5 to 2 times as long as bespoke stylesheets)
Best delivered as XSLT 3.0 packages with a final entry template and private internal templates

Splitting XML Documents at Milestone Elements

Using the XSLT Upward Projection Method

Problems addressed / Applications

Eliot Kimber’s FO Block Splitting Woes

Geert Borman’s Split-at-Page-Break Woes

Test Document for Splitting at Page Breaks

Luther’s New Testament as TEI XML

Luther’s New Testament as TEI XML

Simplified/Abridged Tree View

The XSLT Upward Projection Method

The XSLT Upward Projection Method

Process /TEI in split mode

The “Conditional Identity Template”

Group 1

Group 2

Group 3

Group 4

Group 5

Splitting Result

The teiHeader would be missing…

The other applications

Performance Measurements

~ Linear Dependence on Doc Size

Dependence on Number of Splitting Points?

Dependence on Number of Chunks at Constant Doc Length

Dependence on Chunk Size at Constant Doc Length

Profiling

Remedies

Streaming

Dynamic XPath Evaluation

XSLT Pays Off!

Process `/TEI` in `split` mode

The `teiHeader` would be missing…