Monday, June 27, 2016

Parsing XMLs with DOM Parser

DOM parsers are the simpler of the two parsers, the other being SAX parser. Its is programmetically less complicated but is also less efficient compared to sax. The DOM parser loads the whole document into the main memory and then parses the whole document all at once as opposed to parsing on encountering in SAX parser. The obvious drawback to loading the full file in memoory is that the efficiency of parsing reduces with the increase in size of the document. Not to mention, documents that don't fit in the memory cannot be parsed.

To understand DOM parser, we take an example xml file and parse it using DOM. Lets consider the following xml -

testXML.xml

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<first>
    <second atName="one">
        <number id="one">1</number>
        <number id="two">2</number>
    </second>
    
    <second atName="two"> 
        <number id="one">1</number>
        <number id="two">2</number>
    </second>
</first>

Our goal is to parse this whole document and output the same using DOM parser. Before beginning with the example lets look into some helper classes and basic methods -

DocumentBuilder - It defines the API to generate the DOM document tree from an XML. Its usually created by using the DocumentBuilderFactory.newInstance().

Node - It is an interface which represents a node in the DOM tree.

Attr - This is the interface which represents the attributes of a node.

NamedNodeMap - This represents the list of attributes that a node holds.

In our example, the main() method first generates the DOM tree and the processNode() method traverses this tree printing the nodes as it encounters them.

DomParser.java


public class DomParser {

    public static void main(String[] args) {
        try {
            File file = new File("src/testXML.xml");
            DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
            DocumentBuilder db = dbf.newDocumentBuilder();
            Document doc = db.parse(file);
            doc.getDocumentElement().normalize();

            String tab = "";

            System.out.println("Staring Parsing...");

            //Process root Node
            Node root = doc.getDocumentElement();
            System.out.println(root.getNodeName());
            processNode(root, "\t" + tab);

            System.out.println("Parsing Complete...");

        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    public static void processNode(Node node, String tab) {
        try {
            NodeList children = node.getChildNodes();

            for (int i = 0; i < children.getLength(); i++) {
                Node ele = children.item(i);

                //Printing the node name or the text value in case of a Text node
                if (ele.getNodeName().equals("#text")) {
                    System.out.print(" " + ele.getNodeValue());
                } else {
                    System.out.print(tab + ele.getNodeName());
                }

                //Printing attributes of the current node.
                if (ele.hasAttributes()) {
                    NamedNodeMap attrs = ele.getAttributes();
                    for (int j = 0; j < attrs.getLength(); j++) {
                        Attr attribute = (Attr) attrs.item(j);
                        System.out.print(" " + attribute.getName() + "=" + attribute.getValue());
                    }
                }

                //Process children 
                processNode(ele, "\t" + tab);
            }

        } catch (DOMException e) {
            e.printStackTrace();
        }

    }

}

The text nodes appear with a "#text" in them. This nodes accordingly dealt with. The method processNode()is recursively called as it traverses through the whole tree. The output for the above program is as follows -

Output


Staring Parsing...
first
 
     second atName=one 
          number id=one 1 
          number id=two 2 
     
    
     second atName=two  
          number id=one 1 
          number id=two 2 
     
Parsing Complete...


The DOM parser is not a very efficient parser, but for small documents, it can be very useful.

No comments:

Post a Comment