Secure document access in XSLT

“Why can’t my XSLT load a document?” asked my colleague recently. We looked at the problem closely and discovered that the file referenced in the document function had a broken DTD link. The XSLT processor logged a warning, but it was too easy to miss it.

This story gave me a food for thought. And I decided to reflect on practices we’re used to, namely accessing XML files in XSLT.

The same old song

If, for some reason, XSLT 1.0 is your production technology, the document function is the only choice.

The weakness the function has is its undefined fallback behaviour. The spec leaves it to the implementor to decide whether the processor should recover from errors or not. In practice, when it comes to referencing a missing file, my Saxon 9 HE processor logs an error, but does not exit immediately. Why is it bad at all? Well, if you have a template call chain that ends up with a silenced document load error, it can be a real challenge to track down the problem.

If you’re running XSLT 2.0 or higher, please prefer the deterministic doc function. If you’re still on XSLT 1.0, try to be restrictive on types. This line of code will terminate the transform if the document is missing.

<xsl:variable name="configuration" select="document('configuration.xml')" as="document-node()"/>

Caused by: net.sf.saxon.trans.XPathException: An empty sequence is not allowed as the value of variable $configuration

document can still be handy if you want to fall back to a reasonable default when the input file does not exist. It is a good pattern to use with user given configuration files.

<xsl:variable name=“fallback-configuration">
    <!-- ... -->
</xsl:variable>

<xsl:copy-of select="(document('configuration.xml'), $fallback-configuration)[1]"/>

Second shot.

Thanks to XSLT 2.0, the example above can have much richer semantics. First, there’s doc-available to do the verification. Wrapping it in a function with a meaningful name will give even better results.

<xsl:function name="xr:load-configuration">
    <xsl:param name="uri"/>

    <xsl:variable name="fallback-configuration">
        <!-- ... →
    </xsl:variable>

    <xsl:sequence select="if (doc-available($uri))
                          then doc($uri)
                          else $fallback-configuration"/>
</xsl:function>

Another thing to mention is collection function, a powerful way to deal with document sets. Here I would consider to:

Avoid recursing into hidden folders with recurse=no.
Halt transform when a parse error happens with on-error=error.
Make the search pattern as strict as possible. select=.xml is way better than select=, but select=configuration-*.xml is far more better.

<xsl:sequence select="collection(‘conf/?select=configuration-*.xml;recurse=no;on-error=error')"/>

If you’re confident enough about your end users, or you decently analyse your log files, on-error=warning and directory recursion is still fine.

Conclusions

Different programming paradigms mean different kinds of issues. One can have a hard time trying to understand a null reference error in a multithreaded code. Even without nulls, XSLT is no exception. Lack of output data, caused by incomplete, distorted or invalid inputs is a common thing. We cannot embrace all potential weaknesses of our products. But we can save our time by making stylesheets stricter, especially with the recent versions of XSLT.

P.S. Please share your experience. Cheers!

Contents

The same old song

Second shot.

Conclusions