Australia Pilot — Zone Integration Server (ZIS) Enterprise Scale Testing — Detailed Results
In this pilot, there were three objectives, each to be met by a separate test or series of tests:
- Scalability and Load Balance: “To demonstrate that SIF can be implemented to accommodate an increasing transaction load through the addition of system resource, to ensure that SIF architecture can handle projected loads of Victoria and Western Australian Departments of Education.”
- Availability: “The purpose of this test is to determine if a SIF implementation can meet the quality of service characteristic of availability. In this instance, the ZIS cluster is configured to provide active/active failover configuration which, in the event of a failure of a node, the remaining nodes will gracefully continue to process messages without the loss of any messages.”
- Recoverability: “The purpose of this test is to determine if a SIF implementation can meet the quality of service characteristic of recoverability. In this case, the ZIS cluster will provide ‘store and forward’ functionality if an agent is unable to process messages. In the event of an agent failure, the ZIS service will store messages. On the recovery of the agent the ZIS service will forward these messages to the agent. It is expected that this will occur without the loss of any messages.”
The Lab Environment
To prepare for these tests, we needed to prepare a lab with enough equipment to simulate a typical day's traffic in one of the larger Australian states. This would include base and peak loads and would need to include all the features in the ZIS to meet the above requirements. Our initial estimate included five servers, four dedicated for the ZIS (two database and two web servers) and one to the SIF agents. As time went on, we came to realize that although the ZIS allocation was more than enough, we needed to add more power to the agent pool to meet the requirements of the first test.
In setting up this environment, we were faced with many of the same technology choices that Australian IT departments would face when setting up a similar infrastructure. Instead of going through the arduous process here, we've documented those findings at Building a ZIS Farm, but will describe the high-availability solution we chose here:
Although there were a few good choices, we chose to use Microsoft® Windows® Server 2008 R2 Network Load Balancing (NLB), multicast mode, with no affinity (important — this means that every new incoming message is sent to whichever server is available — without it, one server may get overloaded by one particularly active SIF publisher). By choosing the less powerful of the two most widely used server farming technologies (the other being the use of hardware accelerators), we felt that this study would be more transferable to the "non-tested" alternative. Each of the servers has two network cards; we used one of them for the load balancing interface and the other was kept separate and used for server administration.
Here we had many more choices but the list of acceptable choices was fewer. Some were ruled out because they wouldn't work well with the requirements of the SIF protocol and others because of the requirements of these tests. For these tests, we chose to use SQL Server® Enterprise Edition with database mirroring and used a third server as a witness server. We configured our ZIServer to know about these servers, allowing the fastest possible transition between servers in case one of them failed. This choice also provides us with two copies of each of ZIServer's two databases and allows us to take advantage of SQL Server's automatic "page repair" facility in case part of one of the databases ends up with a page fault.
Test Lab Infrastructure
Since all the servers were in a common location, we used HTTP to communicate between them over a 1GB network using unmanaged switches. Although an external firewall was in place and the Windows firewall was active on all servers, the ports through which the servers communicated with each other were opened to each other for these tests.
The Test Lab
As mentioned, the ZIServer™ is implemented using four servers in a high-availability cluster. The seven other servers shown in the diagram below all have copies of ZIAgent™ installed on them and, depending on the test, have scripts running on them that either cause requests to be made or cause events to be generated. These requests ask a provider agent for all copies of StudentPersonal or one of several other object types to be delivered in one of several different message lengths. NOTE: The ZIAgent configurable SIF agent has a built in scripting language that was originally designed to provide agent owners a facility for developing series of regression tests for versions of their SIF agents that would make it simple to test new versions that would need to be create when new versions of their applications or the SIF specification were released. We found that same facility ideal for creating different test scenarios for this study.
computer naming themes: we started off with the Marx brothers (Groucho, Harpo, Chico, Gummo, Zeppo, Karl (sort of)), then moved on to the Three Stooges (Moe, Larry, Curly, Shemp) (why three?),then ended up with characters from Ren & Stimpy (Haggis, Sven) — we did not give a name to the Internet, however
Test 1 — The Scalability and Load Balancing Test
In this test, we created a set of test scripts that caused agents to pass messages to each other. In this test, the messages passed were a mix of events, requests and responses and a variety of message sizes ranging from 1K to 128K. In this first test, we rely on the audits stored in the ZIS to keep track of the messages that flow through it and we count the messages coming in as well as going out. The reason we do this is because the two numbers may be different. For example, in a zone where four agents are subscribing to the StudentPersonal object, for every one StudentPersonal object received by the ZIS, four are distributed. In the same zone, if a single agent makes a SIF request for StudentPersonal objects, there is one message distributed for each one received by the ZIS.
Target Message Loads
In this test, the goal was to test that the selected infrastructure was capable of handling the average traffic that would be seen by an Australian state the size of Western Australia or Victoria. This traffic was estimated to be:
- Base level of 25 messages per second for a sustained period of time
- Peak periods of time with up to 75 messages per second
Sample Data Composition
For these tests, we tried to make the data and the composition of the agents and zones resemble that which might be found in a typical school. So, to do this we used ZIAgent's scripting capabilities to form:
- 20 separate SIF zones, representing 20 schools
- 180 SIF agents, with 20 providers and 160 subscribers
- A mixture of events and request/response streams that varied according to the test that was being performed
- A mixture of message sizes:
- A mixture of SIF Object types:
What is the measured load capacity of SIF infrastructure?
The following chart was derived from a one hour sample taken from the data collected during the test. The black line represents the average message rate, which hovered at about 40 messages/second throughout the test.
This chart also illustrates peak periods in the 60–80 message/minute range and some where the message traffic was lighter. In this simulation, we tried to use several normally configured SIF agents coming from a variety of sources (as opposed to specially configured pieces of software designed especially to pump out messages from a single source as fast as possible) because we felt that this would more closely reflect a real installation. NOTE: The agent scheduler checks for jobs to be rescheduled once a minute. The periods of light activity near 226, 601, 976, etc. are most likely related to one set of scripts finishing and those servers sitting idle waiting for the next instance of script to be rescheduled.
Server Utilization — Web Servers
The following chart summarizes how the web servers were utilized during these tests:
In short Either of the two web servers were capable of handling the entire web traffic by itself.
Specifics: The following screen captures show how the servers held up during this "stress test". For the web servers, we measured how they were withstanding the traffic in four areas: CPU utilization, memory consumption, Disk performance and network traffic. Samples were taken from both servers and were almost identical (as we expected them to be).
The following screen shot is typical of what we saw throughout the test for the two web servers. The two servers ran at about 20% of their capacity throughout the test.
NOTE: if you look on the total CPU graph, you will see a blue line running at or near the 100% mark and a note to the left of the graph that indicates "96% Maximum Frequency". This is not a measure of CPU consumption but rather the result of us configuring the servers' power plan so that the processor's "minimum processor state" and "maximum processor state" are both set to 100%. This tells an Intel processor to always run at full speed and use "Turbo Boost" when appropriate. The default setting for Windows 2008 servers is "Balanced" — while this saves power (less than 2 watts per server in our testing), the setting is so low that the newer Xeon chips' "Turbo Boost" capabilities will never be used unless this setting is changed.
Each of our two web servers has 8GB of RAM (memory). The following screen shot shows how much memory is being used to make web requests and route SIF messages. These web servers are each configured to handle 96 messages concurrently and could be configured to handle more.
On this web server, in the heat of the stress test, about one fourth of the machine's physical memory was in use, leaving about 6GB available.
This screen shot shows a sample of the disk activity during the test.
To add a sense of perspective to this, the only writing that the ZIServer ZIS does to the trace.txt file (the most active file on the disk subsystem) is to write 3 lines for each message it receives (a notification that it received a message, the type of that message and a notification that the acknowledgment was returned). Also note that the scaling on the graph was between 0 and 1MB/second whereas the disk drive subsystem is capable of at least 300 hundred times that performance. NOTE: this performance is indicative of ZIServer and would not be consistent with disk activity readings from a ZIS that uses a file-based database.
In our test lab we used all gigabyte NIC cards and switches to connect the servers. Each of the servers had two NIC cards, but only one of them was used for ZIS message traffic (the other was reserved for ZIS administration). The following screen shot shows the network activity during the test:
As with the other tests, the chosen environment was sufficient to handle the given load, averaging about 10Mb traffic during the test (from the seven servers sending input to two web servers and two database servers).
Server Utilization — Database Servers
In these tests, two database servers were used: GROUCHO and HARPO. Each had two roles with respect to the two ZIS databases (the 'ZIS' working database and the 'ZISAudits' auditing database):
- GROUCHO was the principal server for the ZIS working database and the mirror (backup) for the ZSAudits database
- HARPO was the principal server for the ZISAudits database and the mirror for the ZIS database
These servers we more heavily utilized than the web servers, but still well within their capabilities — the utilization for these servers was recorded from within SQL Server and is summarized here:
(GROUCHO (Primary for ZIS)) The following graph illustrates the server utilization for the more active of the two database servers (GROUCHO) during the scalability and load balancing test:
(GROUCHO (Primary for ZIS)) It shows that an average of 1,100 requests/second were handled through about 120 simultaneous connections.
The size of the transactions logs became so large (gigabytes), that we decided to leave them online and grant remote access to any of the AU staff that would need access. At various junctures, we made database backups and when we thought the data might be interesting, we restored that data back to the server for analysis.
- As mentioned earlier, out biggest difficulty was in getting enough data to the ZIServer to burden it sufficiently for these tests. Since agents have much more work to do than Zone Integration Servers, we ended up needing many times the "agent power" to put a strain on our clustered ZIS.
- Our ZIServer product (unlike others) has built-in message auditing. While we were setting up for this testing, we tried timing results with auditing on and off and they were essentially the same. When we started collecting the details during the other tests, they confirmed this, showing how little effort the auditing added to the work the ZIS was already doing.
Optimal Message Size
On web servers, for every SIF push agent managed, a thread of a "push agent service" is going to need to watch to see that it is receiving its messages correctly. Each of these is going to need to have memory buffers at least large enough to hold a message or two and it would be good if it could cache several more. If an enterprise class ZIS is managing hundreds to thousands of agents, the size of these messages begins to make a difference. In the US, we see some agents passing messages as large as 4MB. If a ZIS is to support 1,000 SIF agents and each thread needs two buffers (it might need more like 5 or 6), this means that the server would need 8GB memory just for the SIF agent thread buffers alone. In short, the server farm would need to grow very quickly. The question arises: Does an agent passing 4MB messages have a better throughput than one passing multiple 128K messages? So, while the environment was set up, we did some more tests to find this out. This is what we found out:
These results may vary slightly depending on hardware used and the data being used, but we have found that the results don't tend to vary too much and that the optimal SIF message size tends to be somewhere in the range of 64K to 128K. So, not only are huge SIF message sizes a bad use of ZIS memory resources, but this test (and others we have conducted before that showed similar results) show that they are not very efficient as well.
What was the cost of "high availability" during this test?
From the perspective of the web servers, the cost was minimal to none: since the high availability was built into the load balancing functionality, adding the one feature (NLB), increased the throughput without costing anything more than the cost of the second server and the software licenses. Since the additional licenses for the added server(s) cost less than the first server, the cost/message that can be handled by multi-server farm is substantially less than that of multiple smaller installations. From the point of view of the database, however, there are disk space and performance costs to maintaining multiple copies of the ZIS databases. These can be measured partially by looking at how much time the principal server spends waiting for the mirror server to replicate its write transactions. This graph shows an overview of things that the GROUCHO server spent waiting for during the test:
(GROUCHO (Primary for ZIS)) Disk mirroring falls into the category of 'Other'. In order to see how much of the 'Other' category corresponds to disk mirroring, we can look at the numeric detail behind this graph:
|Wait Category||Completed Waits||Wait Time (ms/sec)||% of Total Wait Time|
Adding up the categories that correspond to disk mirroring, we find that 19.82% of the SQL Server wait time was spent on disk-mirror related activities (and DBMIRROR_EVENTS_QUEUE means that the mirror process is waiting because there is no work to do). Subtracting DBMIRROR_EVENTS_QUEUE, we are left with 12.5%. Lastly, take into consideration that the SQL Server total wait time was only a very small fraction of the total time spent by SQL Server doing its job and the mirroring cost becomes a tiny cost compared to its value.
What did "high-safety" mode cost us?
SQL Server offers two modes: synchronous ("high-safety") and asynchronous ("high-performance"). Technically, the difference between the two is that in high-safety mode, if a server has just finished writing to disk and needs to make a copy to the mirror server, it will wait until that mirror server write has completed before it considers the write as "completed". In high-performance mode, it asks the mirror server to write it out and hopes that it gets done. In our early testing we tried some tests comparing the two modes of operation and were surprised to find out that the difference was so small that it got "lost in the noise" — there were too many other random things going on (like virus checkers, automatic updates, other operating system services, etc.) that were much more significant than the difference between the two modes, that we couldn't even tell the difference between the two. The short answer: next to nothing in the scheme of things.
What is the typical configuration required to support the required load levels of Australian jurisdictions at the enterprise level?
The following recommendations are specific to the software tested in this study: specifically the ZIServer Zone Integration Server running on Microsoft Windows Server 2008, using Microsoft SQL Server 2008 as its database server. The reason we mention this is that, for example, if you used a file-based database instead of an enterprise-class database, the scalability characteristics would be entirely different because the designs are different (there would likely be more disk activity, perhaps less of something else…). Given this… We recommend installing at a minimum something similar to the configuration we used to conduct these tests. This would include four servers, two for web services and two for database services. As you see in these test results, the only server of the four that showed stress at all under these conditions was the one database server (GROUCHO) that was the primary server for the ZIS database. Although its current configuration was adequate, its CPU usage was in the 40–50% range during the testing period.
Because our offices are located in a historical building and we have some electrical as well as air conditioning challenges, we decided to use tower model servers (as opposed to rack mount servers). Each of the four servers used for the ZIServer farm were identically configured:
- Dell PowerEdge T410 Server with two Xeon 5500 Series Processors
- 8GB memory
- 2 x 500 GB Disk (RAID 1)
- 2 Standard Gigabit NIC Adapters
The systems level software we chose to install on these servers was:
- Windows Server 2008 R2 Enterprise Edition (although Standard Edition would have met the needs of this pilot and would meet the needs of a smaller ZIServer implementation as well)
- SQL Server 2008 Enterprise Edition (although Standard Edition would have met the needs of this pilot and would meet the needs of a smaller ZIServer implementation as well)
- Microsoft Forefront Client Security
How do large scale single systems with multiple functionality compare with the connect-up-approach using SIF?
The SIF approach and the approach taken by those in many other industries share many of the same basic principles when the technology is viewed from a high level. Data is packaged into XML messages, sent from a publisher to those subscribers who need it when it is needed without human intervention. It is a huge improvement over what it replaces and is almost always simple to cost justify. SIF, however, does have a few advantages not shared with many of these — some possible because there were a few pioneer risk-takers who worked together and some government agencies gave the added incentive to make it work when needed. Some of these advantages include :
- Unlike the large scale systems in other industries, SIF has an industry-wide standard that is being adopted by the supplier (vendor) community. When ubiquitous, a supplier will only need to support a single automated interface to its application, instead of the 20–30 that it may have needed to support in the past.
This will lower the cost of application software, improve its quality (more time spent on the application, less on interfaces), and/or encourage others to enter the marketplace.
- The school organization gains supplier independence because, at least from a data interface point of view, applications of the same type from different suppliers start to become more interchangeable.
Is SIF suitable to support the required load levels of an enterprise deployment in WA/Victoria?
As shown through these tests, a SIF infrastructure should be able to support the required load levels of an enterprise deployment in WA or Victoria in a reliable and robust environment. The Zone Integration Server software that we tested, installed on modest hardware, was able to handle the estimated load requirements for these Australian states with minimal to average machine utilization. As we will see in Test 2.5, even in failover mode the tested software on the tested platform was capable of sustaining a load that exceeds the estimated normal load for an Australian state such as Victoria or Western Australia.
Does the SIF approach seem likely to reduce risks when compared with proprietary systems?
Assuming that both the proprietary and SIF infrastructures were implemented using similar high availability approaches (probably similar to those we used in these tests), both implementations would be very reliable and have similar operational risk profiles. As to long term risks, we believe that a SIF implementation would have a few risk benefits when compared with a proprietary integration:
- Since each supplier is responsible for its own interface and since there is an independent certifying body monitoring the quality of these interfaces, there is likely to be a high level of interface quality, a test to certify that it works properly and motivation in the right place to make sure that it continues to work properly.
- Since the format of data is standard between one state and it neighbors, when data needs to be consolidated at the federal level no conversion is necessary.
- Because SIF provides application supplier independence, an educational institution (EI) becomes less beholden to a particular supplier and if that supplier no longer meets the EI's requirements, the change to another supplier is far less complex because the interface remains constant.
- Because SIF's interface is extensible, it is adaptable enough to meet requirements that go beyond the pre-defined object definitions when necessary
Test 2 — The Availability Test
Unlike the previous test, we did not conduct this test with all seven servers sending messages at the same time at full volume, but instead with a single server server sending a single stream of data while we watched it closely. Like the previous test, we created a test script using ZIAgent’s scripting language. We chose to use a request/response pair for two reasons:
- The operation would last long enough for the entire test (SIF events happen so quickly they would be difficult to document)
- It is one of the (if not the) most difficult things for a ZIS to get right when it does a failover
The test consisted of the following for each of the four servers that made up the ZIS server farm:
- Record how many StudentPersonal records there are in the sample SIF Provider database
- Clean out the StudentPersonal staging table from the Subscriber SIF agent database
- Have the Subscriber SIF agent make a request for all StudentPersonal with a small message size, so that the messages have one object in each
- As soon as the messages start transferring, shut down the server we are testing
- Let the remainder of the test complete with the remaining three servers
- When the test has completed, check the subscribing SIF agent's StudentPersonal staging table to make sure that the correct records were received
Reports from Availability Test
The counts from the subscribing database after the test for the objects it requested were the same and the objects it requested were the same as well after the test. In order to illustrate this process a little more fully, we created a series of videos created from screen captures that were taken during the tests themselves. While you are watching these, double-click the screen image to expand to full screen. Pressing escape will return back to the smaller image. In this first test, we shut down GROUCHO, the "Principal" database server for the ZIS database, the main working database for the ZIServer and the Mirror for the ZISAudits database. When GROUCHO is shut down, the responsibility for its operation is shifted to HARPO. At this time, both of the web servers will, when they try to access GROUCHO, get an error indicating that GROUCHO has gone offline and will begin to access the failover partner HARPO instead.
In this next video, a similar scenario is followed, but instead this time the roles are reversed and HARPO is shut down and GROUCHO becomes the Primary server for both databases.
In this example, one of the two web servers was shut down. In this scenario, some of the responsibilities for the records being delivered properly belong to the agents and others belong to the ZIS. Also, in terms of web servers, there are several types of activities they are involved with:
- Receiving incoming events
- Receiving messages that are part of a series of responses
- Duplicating, filtering and distributing event messages
- Filtering and distributing response messages
The first three are much simpler than the last and the last is much simpler for SIF pull agents than for SIF push agents. This is why we patterned our test around the last: our "worst case scenario" and hardest for a ZIS to handle properly, especially in a farmed community.
Did the provider resend the response stream from the beginning and did the application that was receiving the responses ignore the responses it already received and process the new responses?
Yes they did, as required to do by the SIF specification. After the failover, there were some messages that were still in queues that needed to be "flushed out", but that is normal and the SIF agents were programmed to expect this behavior (and it is part of what is expected of agents in the SIF specification).
Did the ZIS fulfill its responsibilities according to section 4.2 of the SIF specification?
Yes, the ZIServer software behaved as required by Section 4.2 of the SIF specification.
Notes on Bringing Servers Back Into Operation
When servers are brought back online, they must explicitly be added back into service:
- Database servers must be restored as the "Principal" server for their normal role. This is a normal SQL Server function that involves using the SQL Server Management Studio to manually "failover" the database from the backup server back to its normal server.
- After a web server is brought back online, the web server farm should be re-balanced through the ZIS management console. This is done by selecting all of the servers that are online and clicking the link to re-balance the server. Within a few minutes the ZIServer farm will re-balance itself (it needs to gracefully move threads that may be active at that time
Test 2.5 — Combining the Availability and Scalability Tests
In this test, we wanted to combine parts of the first tests to see how the server cluster would perform in "wounded mode"; that is: can it sustain a normal load while running in failover mode? So, for this test, we combined parts of the first two tests as follows:
- Start with the ZIServer farm running in its normal state (with all four servers running)
- Start feeding it messages from agent machines (using ZIAgent scripts) until the level has reached at least an average "normal" level or 25 messages/second (a little over 2 million messages per day)
- Start the recoverability test (have a request/response started from one machine)
- Shut down one of the web servers and have the other web server failover (receive all the SIF agent push threads)
- Watch the performance of the normal 25 message/second traffic to see if the single server is able to accommodate it
- "Push" the limits of the single server to see if it can handle more traffic (perhaps 40 messages/second)
In summary, the test looked like this:
(click on image to see full-size graph in separate window) The three called out points in this graph are:
- This is where we shut down the web server CHICO
- This is where the re-balancing to GUMMO was complete (norma