Lesson 6 - Introduction to XML and writing via the SAX approach

C# .NET Files and I/O Introduction to XML and writing via the SAX approach

In the previous lesson, Storing objects in CSV format in C# .NET, part 2, we wrote a database using text files, or more accurately, using files of the CSV format. In today's tutorial, we're going to focus on the XML format. First, we'll describe it, then, we'll introduce classes which the .NET framework provides for reading and writing to and from these files. We'll try writing out today and leave reading for the next lesson.

The XML format

We're about to go over lots of terms. Now, if you don't understand any of them, don't worry, we'll go into as much detail as possible :)

XML (eXtensible Markup Language) is a markup language developed by W3C (the organization that is responsible for Web standards). XML is very universal and is supported by a number of languages and applications. The word extensible indicates the ability to create your own language using XML, one of which is XHTML for creating websites. XML is a self-describing language, meaning that it has a structure in which we can determine what each value means. In CSV files, we can only guess what the third number eight means, whereas in XML, it'd be immediately clear that it's the number of articles that a user has made. The disadvantage to it is that the XML files are larger, but it isn't inconvenient in most cases. Personally, I almost always choose to use the XML format, it's a good choice for saving a program's configuration, high scores for game players, or for saving a small user database. Thanks to XSD schemas, we can also validate them so that we can prevent errors during run time.

XML can be processed in different ways. Usually, by continuously reading/writing or using a DOM object structure. We're so far in that some tools (including .NET libraries) allow us to work with XML just like a database and execute SQL queries on it. As you can imagine, this saves a lot of work. Another language for querying XML files is XPath.

XML competes with JSON, which is simpler but less popular in business applications. Unlike XML, it can be used to easily log at the end of a file without loading the entire document.

XML is very often used to exchange data between different systems (e.g. desktop applications and web applications on a server). Therefore, as we already mentioned, there are many libraries for it and every tool is aware of and is able to work with it. This includes web services, SOAP, and so on. However, we won't deal with any of them now.

Last time, we saved a list of users to a CSV file. We saved their name, age, and date of registration. The values were next to each other, separated by semicolons. Each line represented a user. The file's contents looked like this:

John Smith;22;3/21/2000
James Brown;31;10/30/2012

Anyone who isn't directly involved wouldn't know what any of that means, would they? Here is the equivalent to that file in the XML format:

<?xml version="1.0" encoding="UTF-8" ?>
<users>
        <user age="22">
                <name>John Smith</name>
                <registered>3/21/2000</registered>
        </user>
        <user age="22">
                <name>James Brown</name>
                <registered>10/30/2012</registered>
        </user>
</users>

Now everyone can tell what is stored in the file. I saved age as an attribute just to demonstrate that XML is able to do things like that. Otherwise, it'd be saved as an element along with the name and registration date. Individual items are called elements. I'm sure you're all familiar with HTML, which is based on the same fundamentals as XML. The elements are usually paired, meaning that we write the opening element followed by the value and then the closing element with a slash. Elements can contain other elements, so it has a tree structure. Furthermore, we're able to save an entire hierarchy of objects into a single XML document.

At the beginning of an XML file, there is a header. The document has to contain exactly one root element in order for it to be valid. Here, it's the user element which contains the other nested elements. Attributes are written after the attribute name in quotation marks.

As you can probably tell, the file got bigger, which is the price paid for it to look pretty. If the user had more than three properties, you'd be able to see just how messy the CSV format can get, and how worthwhile the XML format is. Personally, as I gain more and more experience, I prefer solutions that are clear and simple, even if that means that they occupy more memory. This not only applies to files but for source codes as well. There is nothing worse than when a programmer looks at their code after a year and has no idea what the eighth parameter in a CSV file is when there are 100 numbers per line. Even worse, having a five-dimensional array, which is super fast, but if they designed an object structure instead, they wouldn't have to write this functionality ever again. However, that last part was a going off on a tangent to an extent.

XML in .NET

We'll focus on two fundamental approaches to work with XML files - the continuous approach (the SAX parser) and the object oriented approach (DOM). Today's and the next lessons will be dedicated to SAX, after which we'll get to DOM. Again, there are more ways to work with XML files using the .NET framework. Some are old and only present for backward compatibility's sake. I spent quite a lot of time working with XML files within .NET, so I only added the most modern approached and simple constructs.

Parsing XML via SAX

SAX (stands for Simple API for XML) is actually a simple extension of the text file reader. Writing is relatively simple. We subsequently write the elements and attributes in the same order as they are present in the file (we ignore the tree structure in this approach). .NET provides the XmlWriter class which relieves us from having to deal with the fact that XML is a text file. We only work with the elements, more accurately, nodes (more on that later).

Reading is performed just like writing. We read the XML as a text file, line by line, from top to bottom. SAX gives us what are known as nodes (XMLNode) which it gets while reading. A node can be an element, an attribute, or a value. We receive nodes in a loop in the same order that they're written in the file. We use the XmlReader class to read XML files. Both classes are in the System.Xml namespace.

The advantage to the SAX approach is its high speed and low memory requirements. We'll see the disadvantages once we compare this approach to the DOM object-oriented approach later on.

Writing XML files

Let's create a simple XML file. We'll use the example with the users above for it. We already worked with the User class last time. Just to be sure, I will show you it here once more. Create a new project, a console application, name it XmlSaxWriting, and add a new class to the project:

class User
{
        public string Name { get; private set; }
        public int Age { get; private set; }
        public DateTime Registered { get; private set; }

        public User(string name, int age, DateTime registered)
        {
                Name = name;
                Age = age;
                Registered = registered;
        }

        public override string ToString()
        {
                return Name;
        }

}

For simplicity's sake, we'll write the code right in the Main() method. All we're really doing is testing out SAX's functionality. At this point, you should already know how to design object-oriented applications properly.

Don't forget to add using System.Xml.

We create an XmlWriter using the (static) factory Create() method. There is another way to do it, but this method is the most appropriate. The object will be wrapped in a using block. Of course, we can only store a single object to XML (e.g. some settings). Here, we'll learn how to store a list of several objects. If you only want to store one object, you'll only need to make very minor changes :)

First, let's create a list of some test users:

List<User> users = new List<User>();
users.Add(new User("John Smith", 22, new DateTime(2000, 3, 21)));
users.Add(new User("James Brown", 31, new DateTime(2016, 10, 30)));
users.Add(new User("Tom Hanks", 16, new DateTime(2011, 1, 12)));

Now we have something to write. We'll have the XML output be nicely formatted and indented according to its tree structure. Unfortunately, this setting is not default, so we'll have to force it by passing an XmlWriterSettings class instance. We'll set its Indent property to true:

XmlWriterSettings settings = new XmlWriterSettings();
settings.Indent = true;

Done. Next, we'll create an instance of the XmlWriter class using the factory Create() method. We'll work in the using block. We pass the file path and settings as parameters to the instance:

using (XmlWriter xw = XmlWriter.Create(@"file.xml", settings))
{
}

Now, let's get to the actual writing. First, let's add in the document header:

xw.WriteStartDocument();

Then (as you should know by now) the root element has to follow which contains the rest of the XML. We use the WriteStartElement() and WriteEndElement() methods for writing elements. The first method takes the name of the element we're opening as a parameter. The second method determines the element name on its own from the document context and it doesn't have any parameters. Let's open the root element, which is the users element in our case:

xw.WriteStartElement("users");

Next, we'll move on to writing individual users so the code can be placed in a foreach loop.

We write the value to the element using the WriteValue() method, which takes its value as a parameter. Similarly, we can add an element attribute using the WriteAttributeS­tring() method, whose parameters are the attribute name and its value. The value is always of the string type, so we have to convert the age to a string in our case. Looping and writing the user elements looks like this (without the nested elements) :

foreach (User u in users)
{
        xw.WriteStartElement("user");
        xw.WriteAttributeString("age", u.Age.ToString());
        xw.WriteEndElement();
}

We'll add one more EndElement() to close the root element and EndDocument() to close the whole document. Like with text files, we have to empty the buffer using the Flush() method. The entire application code now looks like this:

// a collection of test users
List<User> users = new List<User>();
users.Add(new User("John Smith", 22, new DateTime(2000, 3, 21)));
users.Add(new User("James Brown", 31, new DateTime(2016, 10, 30)));
users.Add(new User("Tom Hanks", 16, new DateTime(2011, 1, 12)));

// the XmlWriter settings
XmlWriterSettings settings = new XmlWriterSettings();
settings.Indent = true;

// writes users
using (XmlWriter xw = XmlWriter.Create(@"file.xml", settings))
{
        xw.WriteStartDocument(); // the header
        xw.WriteStartElement("users"); // opens the root users element

        // writes individual users
        foreach (User u in users)
        {
                xw.WriteStartElement("user");
                xw.WriteAttributeString("age", u.Age.ToString());
                xw.WriteEndElement();
        }

        xw.WriteEndElement(); // closes the root element
        xw.WriteEndDocument(); // closes the document
        xw.Flush();
}

Let's run the program and make sure that everything works. The output of the program should look like this (program folder/bin/debug/file.xml):

<?xml version="1.0" encoding="utf-8"?>
<users>
  <user age="22" />
  <user age="31" />
  <user age="16" />
</users>

We can see that SAX recognized that there is no value in the user element, except for an attribute, and generated its element as unpaired. Now, let's add 2 additional elements into the user element, moreover, their name and the registration date properties:

xw.WriteStartElement("name");
xw.WriteValue(u.Name);
xw.WriteEndElement();
xw.WriteStartElement("registered");
xw.WriteValue(u.Registered.ToShortDateString());
xw.WriteEndElement();

None of the elements include additional elements or attributes. These sort of elements (that only hold text values) can be written using a single WriteElementStrin­g() method, whose attributes are the element's name and the value it needs to include:

xw.WriteElementString("name", u.Name);
xw.WriteElementString("registered", u.Registered.ToShortDateString());

Both of the examples do the same thing.

We'll place the shorter one in the part where we write user elements, between its startElement() and endElement() method calls. To be completely clear, here's the code for the loop:

foreach (User u in users)
{
        xw.WriteStartElement("user");
        xw.WriteAttributeString("age", u.Age.ToString());

        xw.WriteElementString("name", u.Name);
        xw.WriteElementString("registered", u.Registered.ToShortDateString());

        xw.WriteEndElement();
}

That's it! As always, you can download the program below. In the next lesson, Reading XML via the SAX approach in C# .NET, we'll read XMLs via SAX.


 

Download

Downloaded 63x (37.74 kB)

 

 

Article has been written for you by David Capka
Avatar
Do you like this article?
No one has rated this quite yet, be the first one!
The author is a programmer, who likes web technologies and being the lead/chief article writer at ICT.social. He shares his knowledge with the community and is always looking to improve. He believes that anyone can do what they set their mind to.
Unicorn College The author learned IT at the Unicorn College - a prestigious college providing education on IT and economics.
Activities (5)

 

 

Comments

To maintain the quality of discussion, we only allow registered members to comment. Sign in. If you're new, Sign up, it's free.

No one has commented yet - be the first!