» Really, XML catalogs matter
Back in the early 2000’s, I had a weblog. I had started it before the term blog had been coined.
This is a post, originally from January 16, 2004, about XML Catalogs, and figuring out what the right thing to test is. It was popular enough to be linked from PHP documentation for the DOMXML extension.
I’ve updated links and corrected typos. I just realized that Marc’s last name had been misspelled for many years!
And some four years after this was originally posted, the W3C had noticed all the validation request traffic from all the XML applications not using catalogs or caches and did something about it.
This post is a snapshot of an era. Remember when we were going to do all the things in XML?
This week I learned that XML Catalogs are important.
This started when I updated Marc Liyanage’s PHP binary for Mac OS X on my development machine.
Pages went from taking milliseconds to over a minute to render. To say I was puzzled would be an understatement. I rolled back to an earlier version.
Looking for Clues
Some initial testing on another machine determined that the slowdown was in the DOMXML extensions to PHP. The extension exposes the Gnome XML and XSLT libraries as functions and objects to PHP.
After searching Google, php.net, and xmlsoft.com, I sent an email to Christian Stocker in Zurich. Christian works on the DOMXML extensions, and he might know of a bug.
I had gotten it in my head that the problem lay in nesting XInclude statements. XInclude is a specification for including one XML document inside another. We use XInclude to keep content for one of our sites isolated to a well-formed, valid XHTML document that can be edited in BBEdit.
A section of the intranet is described as an Atom feed, and each article’s contents included into the feed. The Atom feed is included in an envelope document that contains the rest of the XML needed to render any page in the section.
I had jumped on the conclusion that somehow LibXML2 had changed and it had become inefficient at resolving nested XIncludes.
Christian wrote back that there weren’t any issues he knew of, but asked me to send a test case.
The Wrong Test
I had devised the test case:
<?xml version="1.0" encoding="utf-8"?> <foo xmlns:xi="http://www.w3.org/2001/XInclude"> <xi:include href="bar.xml"/> </foo>
<?xml version="1.0" encoding="utf-8"?> <bar xmlns:xi="http://www.w3.org/2001/XInclude"> <xi:include href="baz.xml"/> </bar>
<?xml version="1.0" encoding="utf-8"?> <baz>Content!</baz>
When run with:
<?php $dom = domxml_open_file ("foo.xml"); $start1 = gettimeofday(); $dom->xinclude(); $end1= gettimeofday(); $totaltime1 = (float)($end1['sec'] - $start1['sec']) + ((float)($end1['usec'] - $start1['usec'])/1000000); echo "Time to handle includes: $totaltime1<br>"; echo $dom->dump_mem (); ?>
That should return:
<?xml version="1.0" encoding="utf-8"?> <foo xmlns:xi="http://www.w3.org/2001/XInclude"> <bar xmlns:xi="http://www.w3.org/2001/XInclude"> <baz>Content!</baz> </bar> </foo>
Which it did, but faster than I expected. It timed at less than a second instead of over a minute.
I changed bar.xml to:
<?xml version="1.0" encoding="utf-8"?> <bar xmlns:xi="http://www.w3.org/2001/XInclude"> <xi:include href="baz.html"/> </bar>
and baz.html was:
<?xml version="1.0"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>Untitled</title> </head> <body> <p>New document</p> </body> </html>
Which did take several seconds as I thought it would.
The Right Test
That’s where it dawned on me that XInclude between the version of the libraries PHP used, had started validating by default.
The XHTML DTD URL: http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd is a busy place. Try loading it and see. And I bet that URL is loaded because a lot of people didn’t know their tool was calling over there every time it needed to load or validate something.
Commenting out the DTD declaration in baz.html and re-running the test brings back the earlier level of performance. However, I don’t want to comment out the DTD references in my documents.
Going to Catalogs
I wrote back to Christian asking if LibXML, as built for PHP, honored XML Catalog files.
With a catalog file, I can tell my validating processor to resolve any reference to “http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd” as a local file. Catalogs can do more than that, but the local resolution of DTD files is important.
Christian replied that by default, LibXML looks for a catalog at
/etc/xml/catalog. So I created a catalog there.
<?xml version="1.0"?> <!DOCTYPE catalog PUBLIC "-//OASIS//DTD Entity Resolution XML Catalog V1.0//EN" "http://www.oasis-open.org/committees/entity/release/1.0/catalog.dtd"> <catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog"> <public publicId="-//W3C//DTD XHTML 1.0 Transitional//EN" uri="file:///etc/xml/xhtml/DTD/xhtml1-transitional.dtd" /> </catalog>
I pointed “http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd” to a local directory Apache could read from, put a copy of the DTD files there, and tried the tests again. Not as fast as without validation, but certainly faster since it didn’t have to go over the Internet to validate the included file.
So there you go, catalog files, really important. I am chastened.
Thanks to Christian for getting me pointed in the right direction on this.
Originally published on January 16, 2004 on whump.com (but nobody goes there anymore) and updated April 6, 2020, by ECH.