Lesson 6 - Introduction to XML and writing via the SAX approach

In the previous lesson, Storing objects in CSV format in C# .NET, part 2, we wrote a database using text files, or more accurately, using files of the CSV format. In today's tutorial, we're going to focus on the XML format. First, we'll describe it, then, we'll introduce classes which the .NET framework provides for reading and writing to and from these files. We'll try writing out today and leave reading for the next lesson.

The XML format

We're about to go over lots of terms. Now, if you don't understand any of them, don't worry, we'll go into as much detail as possible

XML (eXtensible Markup Language) is a markup language developed by W3C (the organization that is responsible for Web standards). XML is very universal and is supported by a number of languages and applications. The word extensible indicates the ability to create your own language using XML, one of which is XHTML for creating websites. XML is a self-describing language, meaning that it has a structure in which we can determine what each value means. In CSV files, we can only guess what the third number eight means, whereas in XML, it'd be immediately clear that it's the number of articles that the user has written. The disadvantage to it is that the XML files are larger, but it's not a problem for us in most cases. Personally, I almost always choose to use the XML format, it's a good choice for saving a program's configuration, high scores for game players, or for saving a small user database. Thanks to XSD schemas, we can also validate them so that we can prevent errors during run time.

XML can be processed in different ways. Usually, by continuously reading/writing or using a DOM object structure. We're so far in that some tools (including .NET libraries) allow us to work with XML just like a database and execute SQL queries on it. As you can imagine, this saves a lot of work. Another language for querying XML files is XPath.

XML competes with JSON, which is simpler but less popular in business applications. Unlike XML, it can be used to log at the end of a file easily without loading the entire document.

XML is very often used to exchange data between different systems (e.g. desktop applications and web applications on a server). Therefore, as we've already mentioned, there are many libraries for it and every tool is aware of and is able to work with it. This includes web services, SOAP, and so on. However, we won't deal with any of them now.

Last time, we saved a list of users to a CSV file. We saved their name, age, and date of registration. The values were next to each other, separated by semicolons. Each line represented a user. The file's contents looked like this:

John Smith;22;3/21/2000
James Brown;31;10/30/2012

Anyone who isn't directly involved wouldn't know what any of that means, would they? Here is the equivalent to that file in the XML format:

<?xml version="1.0" encoding="UTF-8" ?>
<users>
    <user age="22">
        <name>John Smith</name>
        <registered>3/21/2000</registered>
    </user>
    <user age="22">
        <name>James Brown</name>
        <registered>10/30/2012</registered>
    </user>
</users>

Now everyone can tell what is stored in the file. I saved age as an attribute just to demonstrate that XML is able to do things like that. Otherwise, it'd be saved as an element along with the name and registration date. Individual items are called elements. I'm sure you're all familiar with HTML, which is based on the same fundamentals as XML. The elements are usually paired, meaning that we write the opening tag followed by the value and then the closing tag with a slash. Elements can contain other elements, so it has a tree structure. Furthermore, we're able to save an entire hierarchy of objects into a single XML document.

At the beginning of an XML file, there is a header. The document has to contain exactly one root element in order for it to be valid. Here, it's the <user> element which contains the other nested elements. Attributes are written after the attribute name in quotation marks.

As you can probably tell, the file got bigger, which is the price paid for it to look pretty. If the user had more than three properties, you'd be able to see just how messy the CSV format can get, and how worthwhile the XML format is. Personally, as I gain more and more experience, I prefer solutions that are clear and simple, even if that means that they occupy more memory. This not only applies to files but for source codes as well. There is nothing worse than when a programmer looks at their code after a year and has no idea what the eighth parameter in a CSV file is when there are 100 numbers per line. Even worse, having a five-dimensional array, which is super fast, they wouldn't have to rewrite this whole functionality now. However, let's get back to today's topic.

XML in .NET

We'll focus on two fundamental approaches to work with XML files - the continuous approach (the SAX parser) and the object oriented approach (DOM). Today's and the next lessons will be dedicated to SAX, after which we'll get to DOM. Again, there are more ways to work with XML files using the .NET framework. Some are old and only present for backward compatibility's sake. I spent quite a lot of time working with XML files within .NET, so I only added the most modern approached and simple constructs.

Parsing XML via SAX

SAX (stands for Simple API for XML) is actually a simple extension of the text file reader. Writing is relatively simple. We subsequently write the elements and attributes in the same order as they are present in the file (we ignore the tree structure in this approach). .NET provides the XmlWriter class which relieves us from having to deal with the fact that XML is a text file. We only work with the elements, more accurately, nodes (more on that later).

Reading is performed just like writing. We read the XML as a text file, line by line, from top to bottom. SAX gives us what are known as nodes (XMLNode) which it gets while reading. A node can be an element, an attribute, or a value. We receive nodes in a loop in the same order that they're written in the file. We use the XmlReader class to read XML files. Both classes are in the System.Xml namespace.

The advantage to the SAX approach is its high speed and low memory requirements. We'll see the disadvantages once we compare this approach to the DOM object-oriented approach later on.

Writing XML files

Let's create a simple XML file. We'll use the example with the users above for it. We already worked with the User class last time. Just to be sure, I will show you it here once more. Create a new project, a console application, name it XmlSaxWriting, and add a new class to the project:

class User
{
        public string Name { get; private set; }
        public int Age { get; private set; }
        public DateTime Registered { get; private set; }

        public User(string name, int age, DateTime registered)
        {
                Name = name;
                Age = age;
                Registered = registered;
        }

        public override string ToString()
        {
                return Name;
        }

}

For simplicity's sake, we'll write the code right in the Main() method. All we're really doing is testing out SAX's functionality. At this point, you should already know how to design object-oriented applications properly.

Don't forget to add using System.Xml.

We create an XmlWriter using the (static) Create() factory method. There is another way to do it, but this method is the most appropriate. The object will be wrapped in a using block. Of course, we could only store a single object to XML (e.g. some settings). Here, we'll learn how to store a list of several objects. If you only want to store one object, you'll only need to make very minor changes

First, let's create a list of some test users:

List<User> users = new List<User>();
users.Add(new User("John Smith", 22, new DateTime(2000, 3, 21)));
users.Add(new User("James Brown", 31, new DateTime(2016, 10, 30)));
users.Add(new User("Tom Hanks", 16, new DateTime(2011, 1, 12)));

Now we have something to write. We'll have the XML output be nicely formatted and indented according to its tree structure. Unfortunately, this setting is not default, so we'll have to force it by passing an XmlWriterSettings class instance. We'll set its Indent property to true:

XmlWriterSettings settings = new XmlWriterSettings();
settings.Indent = true;

Done. Next, we'll create an instance of the XmlWriter class using the factory Create() method. We'll work in the using block. We pass the file path and settings as parameters to the instance:

using (XmlWriter xw = XmlWriter.Create(@"file.xml", settings))
{
}

Now, let's get to the actual writing. First, let's add in the document header:

xw.WriteStartDocument();

Then (as you should know by now) the root element has to follow which contains the rest of the XML. We use the WriteStartElement() and WriteEndElement() methods for writing elements. The first method takes the name of the element we're opening as a parameter. The second method determines the element name on its own from the document context and it doesn't have any parameters. Let's open the root element, which is the users element in our case:

xw.WriteStartElement("users");

Next, we'll move on to writing individual users so the code can be placed in a foreach loop.

We write the value to the element using the WriteValue() method, which takes its value as a parameter. Similarly, we can add an element attribute using the WriteAttributeString() method, whose parameters are the attribute name and its value. The value is always of the string type, so we have to convert the age to a string in our case. Looping and writing the <user> elements looks like this (without the nested elements) :

foreach (User u in users)
{
    xw.WriteStartElement("user");
    xw.WriteAttributeString("age", u.Age.ToString());
    xw.WriteEndElement();
}

We'll add one more EndElement() to close the root element and EndDocument() to close the whole document. Like with text files, we have to empty the buffer using the Flush() method. The entire application code now looks like this:

// a collection of test users
List<User> users = new List<User>();
users.Add(new User("John Smith", 22, new DateTime(2000, 3, 21)));
users.Add(new User("James Brown", 31, new DateTime(2016, 10, 30)));
users.Add(new User("Tom Hanks", 16, new DateTime(2011, 1, 12)));

// the XmlWriter settings
XmlWriterSettings settings = new XmlWriterSettings();
settings.Indent = true;

// writes the users
using (XmlWriter xw = XmlWriter.Create(@"file.xml", settings))
{
    xw.WriteStartDocument(); // the header
    xw.WriteStartElement("users"); // opens the root users element

    // writes individual users
    foreach (User u in users)
    {
        xw.WriteStartElement("user");
        xw.WriteAttributeString("age", u.Age.ToString());
        xw.WriteEndElement();
    }

    xw.WriteEndElement(); // closes the root element
    xw.WriteEndDocument(); // closes the document
    xw.Flush();
}

Let's run the program and make sure that everything works. The output of the program should look like this (program folder/bin/debug/file.xml):

<?xml version="1.0" encoding="utf-8"?>
<users>
  <user age="22" />
  <user age="31" />
  <user age="16" />
</users>

We can see that SAX recognized that there is no value in the user element, except for an attribute, and generated this element as unpaired. Now, let's add 2 additional elements into the <user> element, moreover, their name and the registration date properties:

xw.WriteStartElement("name");
xw.WriteValue(u.Name);
xw.WriteEndElement();
xw.WriteStartElement("registered");
xw.WriteValue(u.Registered.ToShortDateString());
xw.WriteEndElement();

None of the elements include additional elements or attributes. These sort of elements (that only hold text values) can be written using a single WriteElementString() method, whose attributes are the element's name and the value it needs to include:

xw.WriteElementString("name", u.Name);
xw.WriteElementString("registered", u.Registered.ToShortDateString());

Both of the examples do the same thing.

We'll place the shorter one in the part where we write user elements, between its startElement() and endElement() method calls. To be completely clear, here's the code for the loop:

foreach (User u in users)
{
    xw.WriteStartElement("user");
    xw.WriteAttributeString("age", u.Age.ToString());

    xw.WriteElementString("name", u.Name);
    xw.WriteElementString("registered", u.Registered.ToShortDateString());

    xw.WriteEndElement();
}

That's it! As always, you can download the program below. In the next lesson, Reading XML via the SAX approach in C# .NET, we'll read XMLs via SAX.

Download

By downloading the following file, you agree to the license terms

Downloaded 491x (37.74 kB)

Article has been written for you by David Capka Hartinger

User rating:

No one has rated this quite yet, be the first one!

The author is a programmer, who likes web technologies and being the lead/chief article writer at ICT.social. He shares his knowledge with the community and is always looking to improve. He believes that anyone can do what they set their mind to.

David learned IT at the Unicorn University - a prestigious college providing education on IT and economics.

Activities