epub:html-splitter html-splitter

epubtools/modules/html-splitter/xpl/html-splitter.xpl

Import URI: http://transpect.le-tex.de/epubtools/modules/html-splitter/xpl/html-splitter.xpl

Sample invocation (for debugging purposes):

calabash/calabash.sh 
    -i source=file:/$(cygpath -ma ../content/output/debug/epubtools/create-ops/pre-split.html) 
    -i conf=file:/$(cygpath -ma adaptions/publisher/series/epubtools/heading-conf.xml) 
    -o result=tmp.html -o report=report.xml -o files=files.xml  
    file:/$(cygpath -ma epubtools/modules/html-splitter/xpl/html-splitter.xpl) 
    base-uri=file:/$(cygpath -ma ../content/output/debug/epubtools/create-ops/pre-split.html)
    debug=yes
    debug-dir-uri=file:/$(cygpath -ma ../content/output/debug)

Calabash seems to suppress some XSLT errors, for instance when a stylesheet is looping. Therefore it might be necessary to replace collection()[…] with document(…) in the XSL (alternative variable declarations are already included in the xsl file, commented out) and run saxon from the command line, for example like this:

PRE_SPLIT=file:/$(cygpath -ma ../content/le-tex/whitepaper/de/output/output/debug/epubtools/create-ops/pre-split.html)
saxon -xsl:epubtools/modules/html-splitter/xsl/html-splitter.xsl -s:$PRE_SPLIT -it:main \
    debug-dir-uri=file:/$(cygpath -ma debug) \
    debug=yes \
    final-pub-type=EPUB2 \
    heading-conf-uri=file:/$(cygpath -ma adaptions/common/epubtools/heading-conf.xml) \
    meta-uri=file:/$(cygpath -ma ../content/le-tex/whitepaper/de/output/output/debug/epubtools/epub-config.xml) \
    datadir=file:/$(cygpath -ma debug/datadir)

Input Ports

NameDocumentationConnections

source

conf

/hierarchy – may be included in /epub-config

meta

/epub-config

css-xml

XML representation of the parsed CSS

Output Ports

NameDocumentationConnections

result

files

report

Options

NameDocumentationDefault

base-uri

target

'EPUB2'

debug

'no'

debug-dir-uri

'debug'

Subpipeline

You might need to comment out this p:try/p:catch and move name="html-splitter-group" to the followin p:group in order to facilitate debugging if there is an error in the splitter XSLT.

In extreme cases, it might be necessary to invoke the XSLT directly. For instructions, see the comments after the xsl:param instructions in html-splitter.xsl.

StepInputsOutputsOptions

p:variable css-handling

meta on html-splitter

(/epub-config/@css-handling, 'regenerated-per-split')[1]

p:identity strip-leading-non-elements

source

source on html-splitter

result

p:group html-splitter-group

p:variable workdir

result on strip-leading-non-elements

replace($base-uri, '^(.*[/])+(.*)', '$1')

p:variable basename

result on strip-leading-non-elements

replace($base-uri, '^(.*[/])+(.*?)(\.[\w.]+)$', '$2')

p:variable indent

meta on html-splitter

(/epub-config/@indent, 'true')[1]

letex:store-debug d94e102

source

result on strip-leading-non-elements

result

pipeline-step = concat('epubtools/html-splitter/', $basename, '/splitter-input')

active = $debug

base-uri = $debug-dir-uri

p:xslt split

source

result on strip-leading-non-elements

conf on html-splitter

meta on html-splitter

stylesheet

p:document../xsl/html-splitter.xsl

result

template-name = 'main'

letex:store-debug d94e141

source

result on split

result

pipeline-step = concat('epubtools/html-splitter/', $basename, '/chunks')

active = $debug

base-uri = $debug-dir-uri

p:sink d94e150

source

result on d94e141

p:choose per-split-css

$css-handling = 'regenerated-per-split'

p:xslt per-split-css-xml-representations

Primary output: the new, reduced common CSS. Secondary port: individual CSS files if applicable.

parameters

p:empty

stylesheet

p:document../xsl/per-split-css.xsl

source

css-xml on html-splitter

secondary on split

result

p:sink d94e192

source

result on per-split-css-xml-representations

p:xslt insert-individual-css-link

parameters

p:empty

stylesheet

p:document../xsl/insert-individual-css-link.xsl

source

secondary on split

secondary on per-split-css-xml-representations

template-name = 'main'

p:for-each generate-css

result on per-split-css-xml-representations

secondary on per-split-css-xml-representations

css:generate gen

source

current on generate-css

result

prepend-resource-path = '../'

$css-handling = 'unchanged'

p:identity d94e231

source

secondary on split

result

p:otherwise

regenerated

css:generate gen

source

css-xml on html-splitter

result

prepend-resource-path = '../'

p:sink d94e255

source

result on gen

p:identity d94e257

source

secondary on split

result on gen

result

p:for-each store-chunks

result on per-split-css

p:variable chunk-file-uri

base-uri(/*) instead of base-uri() because we set the base uri of the primary CSS by adding an xml:base attribute.

replace(base-uri(/*), 'chunks/', 'epub/OEBPS/')

p:choose d94e290

matches($chunk-file-uri, '\.ncx$' )

p:store store-chunk

source

source on html-splitter

result

include-content-type = 'true'

omit-xml-declaration = 'false'

href = $chunk-file-uri

doctype-public = if($target eq 'EPUB3') then '' else '-//NISO//DTD ncx 2005-1//EN'

doctype-system = if($target eq 'EPUB3') then '' else 'http://www.daisy.org/z3986/2005/ncx-2005-1.dtd'

matches($chunk-file-uri, '\.(txt|css)$')

p:store d94e310

source

source on html-splitter

result

method = 'text'

encoding = 'UTF-8'

href = $chunk-file-uri

$target eq 'EPUB3'

p:store store-chunk

source

source on html-splitter

result

include-content-type = 'false'

omit-xml-declaration = 'false'

method = 'xhtml'

indent = if ($indent = 'true') then 'true' else 'false'

href = $chunk-file-uri

$target = 'EPUB2' and matches(base-uri(), 'nav\.xhtml$')

p:sink d94e333

source

source on html-splitter

p:otherwise

p:delete d94e340

source

source on html-splitter

result

match = '@epub:type | html:nav[@epub:type = 'landmarks']'

p:store store-chunk

source

result on d94e340

result

include-content-type = 'true'

omit-xml-declaration = 'false'

method = 'xhtml'

doctype-public = '-//W3C//DTD XHTML 1.1//EN'

doctype-system = 'http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd'

indent = if ($indent = 'true') then 'true' else 'false'

href = $chunk-file-uri

p:xslt collect-file-uri

source

current on store-chunks

stylesheet

p:document../xsl/collect-file-uri.xsl

result

p:sink d94e367

source

p:for-each signal-splitting-error

The presence of an orig.txt is an indicator that the split text differs from the original text. We’ll raise an error. We don’t do it immediately within the split step because we want to store the results first so that you can do forensics.

secondary on split

p:add-attribute orig-txt-url

source

 <p>The after-split text differs from the pre-split text. This typically
 occurs when there is text content immediately below the HTML body element.
 Please check your HTML input and/or its generation process. If debugging is 
 switched on, you’ll find two files, <a>orig.txt</a> and <a>chunks.txt</a>, that you may diff
 line by line.</p>

result

match = '/html:p/html:a[1]'

attribute-name = 'href'

attribute-value = base-uri()

p:add-attribute chunks-txt-url

source

result on orig-txt-url

result

match = '/html:p/html:a[2]'

attribute-name = 'href'

attribute-value = replace(base-uri(), 'orig\.txt$', 'chunks.txt')

p:error splitting-error

source

result on chunks-txt-url

result

code = 'epub:SPLT01'

p:wrap-sequence wrap-chunks

source

result on per-split-css

result

wrapper = 'document'

wrapper-namespace = 'http://xmlcalabash.com/ns/extensions'

wrapper-prefix = 'cx'

p:wrap-sequence wrap-chunk-uris

source

files on store-chunks

result

wrapper = 'document'

wrapper-namespace = 'http://xmlcalabash.com/ns/extensions'

wrapper-prefix = 'cx'

p:add-attribute d94e432

source

report on html-splitter-group

result

match = '/*'

attribute-name = 'transpect:step-name'

attribute-value = 'html-splitter'

p:add-attribute report

source

result on d94e432

result

match = '/*'

attribute-name = 'transpect:rule-family'

attribute-value = 'html-splitter'