Thursday, August 12, 2010

DOM for Dummies

The extensible markup language, or XML, is an HTML-like textual data format that encloses various data values, or “content” inside tags that identify the data called “markup”.  These files can be parsed using a set of commands called the Document Object Model, or DOM.

MATLAB supports DOM.  Suppose you have an XML document that reads in part:

<channellist>
  <channel normalize="true" shape="gaussian" name="Channel #1" >
    <center>0.3734</center>
    <width>0.0099</width>
    <polarizer type="none" />
    <dataoutput type="total" />
  </channel>
  <channel normalize="true" shape="gaussian" name="Channel #2" >
    <center>0.3829</center>
    <width>0.0098</width>
    <polarizer type="none" />
    <dataoutput type="total" />
  </channel>

etc.  You wish to pull out the list of all the values – 0.3734, 0.3829, etc. – between the <center> tags in the file “example.xml”.  Here are the MATLAB commands you would use:

xExample = xmlread('example.xml');
xCenter = xPlatform.getElementsByTagName('center');

Unfortunately, the variable xCenter is a DOM construct, not the values you want.  To actually get the values and put them into an array values, a little more processing is order:

for i=0:xCenter.getLength-1,
    values(i+1) = str2num(xCenter.item(i) ... 
       .getFirstChild.getData);
end;

Notice first that the indices into the array xCenter.item begin at zero; thus, if there are ten channels with values under the <center> tags, the elements of xCenter.item would be indexed 0 thru 9.  This is in defiance of the usual MATLAB rule that indices start at 1.

Notice also that the values returned by the method getFirstChild.getData are text strings, not numeric values.  To use the numeric values, we must apply MATLAB’s str2num() function.

XML files can get pretty complex, and the DOM is therefore also complex.  As a novice to XML parsing in MATLAB, figuring out the syntax of the above lines took a couple of hours.

No comments: