epub:html-splitter html-splitter

epubtools/modules/html-splitter/xpl/html-splitter.xpl

Import URI: http://transpect.le-tex.de/epubtools/modules/html-splitter/xpl/html-splitter.xpl

Sample invocation (for debugging purposes):

calabash/calabash.sh 
    -i source=file:/$(cygpath -ma ../content/output/debug/epubtools/create-ops/pre-split.html) 
    -i conf=file:/$(cygpath -ma adaptions/publisher/series/epubtools/heading-conf.xml) 
    -o result=tmp.html -o report=report.xml -o files=files.xml  
    file:/$(cygpath -ma epubtools/modules/html-splitter/xpl/html-splitter.xpl) 
    base-uri=file:/$(cygpath -ma ../content/output/debug/epubtools/create-ops/pre-split.html)
    debug=yes
    debug-dir-uri=file:/$(cygpath -ma ../content/output/debug)

Calabash seems to suppress some XSLT errors, for instance when a stylesheet is looping. Therefore it might be necessary to replace collection()[…] with document(…) in the XSL (alternative variable declarations are already included in the xsl file, commented out) and run saxon from the command line, for example like this:

PRE_SPLIT=file:/$(cygpath -ma ../content/le-tex/whitepaper/de/output/output/debug/epubtools/create-ops/pre-split.html)
saxon -xsl:epubtools/modules/html-splitter/xsl/html-splitter.xsl -s:$PRE_SPLIT -it:main \
    debug-dir-uri=file:/$(cygpath -ma debug) \
    debug=yes \
    final-pub-type=EPUB2 \
    heading-conf-uri=file:/$(cygpath -ma adaptions/common/epubtools/heading-conf.xml) \
    meta-uri=file:/$(cygpath -ma ../content/le-tex/whitepaper/de/output/output/debug/epubtools/epub-config.xml) \
    datadir=file:/$(cygpath -ma debug/datadir)

Visualisation

The pre-creation of this SVG image needs the Graphviz software installed. Please inform your project maintainer.

Output Ports

NameDocumentationConnections

result

files

report

Options

NameDocumentationDefault

base-uri

target

'EPUB2'

debug

'no'

debug-dir-uri

'debug'

Subpipeline

StepInputsOutputsOptions

p:identity strip-leading-non-elements

source

source on html-splitter

result

p:try html-splitter-group

p:group d468e56

p:variable workdir

result on strip-leading-non-elements

replace($base-uri, '^(.*[/])+(.*)', '$1')

p:variable basename

result on strip-leading-non-elements

replace($base-uri, '^(.*[/])+(.*?)(\.[\w.]+)$', '$2')

p:variable indent

meta on html-splitter

(/epub-config/@indent, 'true')[1]

letex:store-debug d468e92

source

result on strip-leading-non-elements

result

pipeline-step = concat('epubtools/html-splitter/', $basename, '/splitter-input')

active = $debug

base-uri = $debug-dir-uri

p:xslt split

source

result on strip-leading-non-elements

conf on html-splitter

meta on html-splitter

stylesheet

p:document../xsl/html-splitter.xsl

result

template-name = 'main'

letex:store-debug d468e131

source

result on split

result

pipeline-step = concat('epubtools/html-splitter/', $basename, '/chunks')

active = $debug

base-uri = $debug-dir-uri

p:sink d468e140

source

result on d468e131

p:for-each store-chunks

result on split

secondary on split

p:variable chunk-file-uri

replace(base-uri(), 'chunks/', 'epub/OEBPS/')

p:choose d468e160

matches($chunk-file-uri, '\.ncx$')

p:store store-chunk

source

source on html-splitter

result

include-content-type = 'true'

method = 'xhtml'

omit-xml-declaration = 'false'

indent = if ($indent = 'true') then 'true' else 'false'

href = $chunk-file-uri

doctype-public = if($target eq 'EPUB3') then '' else '-//NISO//DTD ncx 2005-1//EN'

doctype-system = if($target eq 'EPUB3') then '' else 'http://www.daisy.org/z3986/2005/ncx-2005-1.dtd'

matches($chunk-file-uri, '\.txt$')

p:store d468e178

source

source on html-splitter

result

method = 'text'

href = $chunk-file-uri

$target eq 'EPUB3'

p:store store-chunk

source

source on html-splitter

result

include-content-type = 'false'

omit-xml-declaration = 'false'

method = 'xhtml'

indent = if ($indent = 'true') then 'true' else 'false'

href = $chunk-file-uri

p:otherwise

p:store store-chunk

source

source on html-splitter

result

include-content-type = 'true'

omit-xml-declaration = 'false'

method = 'xhtml'

doctype-public = '-//W3C//DTD XHTML 1.1//EN'

doctype-system = 'http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd'

indent = if ($indent = 'true') then 'true' else 'false'

href = $chunk-file-uri

p:xslt collect-file-uri

source

current on store-chunks

stylesheet

p:document../xsl/collect-file-uri.xsl

result

p:sink d468e223

source

p:for-each signal-splitting-error

The presence of an orig.txt is an indicator that the split text differs from the original text. We’ll raise an error. We don’t do it immediately within the split step because we want to store the results first so that you can do forensics.

secondary on split

p:add-attribute orig-txt-url

source

 <p>The after-split text differs from the pre-split text. This typically
 occurs when there is text content immediately below the HTML body element.
 Please check your HTML input and/or its generation process. If debugging is 
 switched on, you’ll find two files, <a>orig.txt</a> and <a>chunks.txt</a>, that you may diff
 line by line.</p>

result

match = '/xhtml:p/xhtml:a[1]'

attribute-name = 'href'

attribute-value = base-uri()

p:add-attribute chunks-txt-url

source

result on orig-txt-url

result

match = '/xhtml:p/xhtml:a[2]'

attribute-name = 'href'

attribute-value = replace(base-uri(), 'orig\.txt$', 'chunks.txt')

p:error splitting-error

source

result on chunks-txt-url

result

code = 'epub:SPLT01'

p:wrap-sequence wrap-chunks

source

secondary on split

result

wrapper = 'document'

wrapper-namespace = 'http://xmlcalabash.com/ns/extensions'

wrapper-prefix = 'cx'

p:wrap-sequence wrap-chunk-uris

source

files on store-chunks

result

wrapper = 'document'

wrapper-namespace = 'http://xmlcalabash.com/ns/extensions'

wrapper-prefix = 'cx'

p:catch split-failed

error

letex:propagate-caught-error propagate

source

error on split-failed

result

msg-file = 'splitter-error.txt'

code = 'epub:SPLT01'

status-dir-uri = concat($debug-dir-uri, '/status')

p:identity errors

source

result on propagate

result

p:sink d468e319

source

result on errors

p:add-attribute d468e322

source

report on html-splitter-group

result

match = '/*'

attribute-name = 'transpect:step-name'

attribute-value = 'html-splitter'

p:add-attribute report

source

result on d468e322

result

match = '/*'

attribute-name = 'transpect:rule-family'

attribute-value = 'html-splitter'