SIF architecture choices and performance impacts
Over the past many months, we've heard many people talk about several SIF architecture choices and the relative merits of each, especially in terms of performance. So, over the last several weeks, we set out to compare several of these architectures in a controlled environment with a relatively large data set with the intent of simulating a heavy load, such as would be seen in an Australian State or a UK Regional Broadband Consortia.
The data set
The data set we tested with uses the Australia 1.1 specification, and four types of objects from it:
The main objectives were to test for large amounts of data (as you would find with a few very large messages), and more importantly to investigate the problems you would see with many, many small messages. The data set we used had the following numbers of records in it:
- LEAInfo (16 records) — we used these in an experiment to split the population into 16 sections
- SchoolInfo (100 records) — we used these to split the population into 100 divisions
- StudentPersonal (300,000 records) — our test data had very large schools
- StudentSchoolEnrollment (330,000 records) — 30,000 of the students were enrolled in a second school; no students were enrolled in more than two schools
The tests performed were the following:
- Single Zone, one provider, one subscriber, all data is passed from one side to the other by sending events
- Single Zone, one provider, one subscriber, all data is passed from one side to the other by having the subscriber make a request for the data
- Using 16 Virtual Zones, each with a provider, having 16 subscribers make requests simultaneously
- Using 100 Virtual Zones, each with a provider, having 100 subscribers make requests simultaneously
- Using a single zone, but using a separate subscribing agent for each SIF object, assuming that the application at the subscribing end could orchestrate the combined inputs of the multiple SIF agents
The test lab setup
The following diagram depicts the layout of the test lab we used for these experiments. All of the servers have RAID disks and running Windows Server 2008 R2, except for the laptop, which is running Windows 7. All databases are located on either the HARPO or GROUCHO machines and are cross-mirrored on the other database server. The ZEPPO machine is a NAS device used to regularly back up databases during these tests.
The GUMMO server held the ZIS. In the first test, it was moderately busy, but never really taxed the server. In the second and following tests, it acted as one of the subscriber servers. In tests 3 and 4, the CHICO machine was the Envoy server. In order to be able to "digest" all of the output it was producing, we needed to enlist the other servers so that we could keep up with all of the messages being produced by Envoy with no queues anywhere. If we had put all of the agent connections on a single machine (as in the 4th test), we would not have been able to process all of them at the same time because of memory constraints. In a production environment, the subscriber application would likely be a "farmed" application, split across several machines as well (similar to what we have here).
Test 1 — Single Zone, publisher sends events
In this test, we had the Provider SIF agent register in a single zone with a single subscriber application as described in the following diagram:
The Provider agent ran a script that published events for all LEAInfo, SchoolInfo, StudentPersonal and StudentSchoolEnrollment objects (all 630,116 of them). As expected, this took a very long time to complete. During the middle of the process, this snapshot was taken from the ZIServer's message queue:
If you look at the timestamps, you can see that the messages were sent through the ZIS at a rate of about 5–6 per second. Considering that we were using modest computers for these tests, in a real production environment, these numbers may look slightly better. But, at the high side of this average, the messages took about 30 hours to be transferred from the provider to the subscriber. During the test, no bottleneck occurred in the ZIServer being used or in either of the two agents. The following screen shows the activity from the ZIServer machine:
The other two machines' resource monitors looked very similar. Much of the time was spent in the "calls" — the messages being delivered, acknowledgements being sent back, remote procedure calls, database call overhead, mirroring costs, etc., the things that are part of the protocol — the "cost of doing business".
Test 2 — Single Zone, subscriber requests data
This test will vary more depending on the actual size of the data (especially in the StudentPersonal objects), because the number of objects that will fit in a single response message will vary depending on the amount of information in the average object. DISCLAIMER: In our case, we didn't include too many fields in our StudentPersonal, so we saw a very large improvement in speed (the improvement we saw will be more typical of what normal users would see with a mix of objects and a mix of Add, Change, and Delete events mixed in, so we shouldn't be too far off with this anyway). In this database sample, we see the records after they've been inserted into the database in the subscriber SIF Agent. We sampled about 40 minutes' worth of activity and took an average of that period to calculate the numbers below the screen capture:
Differences in server loads between Tests 1 and 2
During these tests, the loads on the servers were characteristically very different from those used during the first test. Most servers were far less active than during the first test, with the exception of the subscriber computer which was more active, but for a shorter period of time.
The Provider IIS Server
- This machine was more compute bound
- Performed less network I/O (sent many fewer messages),
- Performed fewer database INSERT operations (related to SIF messages being produced),
- Performed fewer SELECT operations (it was able to read the configuration and mapping information once for many messages generated)
The ZIS IIS Server
- This machine was much less active because it received far fewer messages
- Performed far less network I/O
- Performed fewer database operations
The Subscriber IIS Server
- This machine had a different mix — it was far more compute bound — it was more active processing the objects in messages and less active processing the SIF protocol
- Performed less network I/O (received far fewer messages)
- Performed more database operations/second because it was not waiting for network I/O — getting these database operations completed more quickly allowed it to complete its job faster
The Database Server
- The database server, in general, was far busier for a shorter amount of time
The following resource monitor was from the busiest of the systems (the Subscriber IIS Server):
Even with all the work it had to do, you can see that it was averaging only 4% CPU utilization, not enough to register for disk and network utilization, and only a small portion of the server's memory.
Test 3 — Zone splitting using 16 zones
In this test and in Test 4, we used the Envoy Virtual zone manager, although similar results could be derived using multi-zone publishing agents. The main differences between using Envoy and the multi-zone agent approach (for the purposes of these tests) are as follows:
- Envoy comes as commercial off-the-shelf software that can add the functionality to existing environments without requiring that any existing agents be modified
- Envoy allows additional flexibility in that Virtual Zones may be created that contain any combination of schools (schools may also appear in more than one zone if needed)
- Envoy allows users to set up data quality checking rules that will prevent data from being forwarded if it doesn't meet certain basic data quality requirements
As far as the subscribing application is concerned:
- Envoy sees either one zone per region or one zone per school and registers as a subscriber in each of them
- Envoy makes requests for any of the SIF objects it needs in any of the zones and expects to receive all of the objects that pertain to that region or school
- Envoy receives data on multiple connections concurrently, up to the limit imposed by the operating system
The goal of this test is to see how long it will take to transfer the same amount of data between a multi-zone publisher (in this test, Envoy) and a multi-zone subscriber, comparing it to the amount of time it took in the previous two examples.
In this diagram, the orange circles represent zones configured on a single ZIServer. The AUCZ zone is the consolidated zone where Envoy receives its data from the original test provider (data for all 100 schools). The 100 schools were assigned randomly to the 16 zones.
Change in data set size (increase)
Because of the way the student population was set up, some of the records might need to be duplicated because of the way the zones were split. Normally, there would be a logical grouping (such as grouping them by LEA) used to split records, so that most of (or all of) the students would never have their records split between two zones. When a student is enrolled in multiple locations, the StudentPersonal record will need to be published into both zones, where the StudentSchoolEnrollment record will only be published into the appropriate zone (the one that is indicated in the SchoolInfoRefId element inside the StudentSchoolEnrollment object). What this means, for this test and the next, is that Envoy will need to publish an extra 20,000 or so "redundant" StudentPersonal objects, so that every zone will be able to retain its independence and so that there will be no StudentSchoolEnrollment records without matching StudentPersonal records in the same zone.
The Test Activity
This screen capture shows the activity level of the Envoy server (CHICO) (the provider of the data to the 16 zones) during this test. The level of activity stayed fairly constant throughout the test.
Notice that the CPU usage was relatively small (11%), but that the ZIAgent process (the engine for Envoy) is running 142 threads at the time the screen shot was taken. The number of threads varied throughout the test but remained fairly large. Envoy served as a proxy provider for the original consolidated provider to each of the 16 zones. The following screen shot illustrates the activity on Envoy's SQL Server (you can click on the image to see a large picture). As typical with a SQL Server's activity, it varied widely in terms of the number of batch transactions / second (when this capture was taken we saw 1,495 / second). The four graphs at the top of the screen, however, shows an average over a longer period of time.
As this screen illustrates, the single database (which was being mirrored by another SQL Server), has plenty of power for this test.
Test 4 — Zone splitting using 100 zones
In this test, we knew that a single subscribing machine would not have enough power to handle all 100 zones, so we split the load between several machines that would all operate as subscribers and would all be connected to the same LAN as the other machines. This test was similar to the previous test, except that unbounded request messages were sent from each of the 100 school zones to the Envoy provider all at the same time.
The following screen was taken to show the ZIS audit log (or part of it) — (if you click on the image you will see a larger picture). As you can see in the logs, messages are being distributed to subscribing agents in the various school zones. These messages were Responses to "unbounded" SIF_Requests that asked for all objects for the given school. Earlier requests has asked for all StudentPersonal objects.
Differences in server loads between Tests 3 and 4
As when comparing the first two tests, we saw some marked differences in server loads between these two tests as well.
The Provider IIS Server (Envoy Server)
- This machine was more compute-bound (many more threads executing concurrently)
- Much more memory was required by the additional zones (new threads are required to handle each new request received in each of the new zones)
- More network I/O was generated, but still a relatively small amount compared to generating events
- As with the previous two tests, the amount of traffic sent was not significant
- No messages were queued at the ZIS at any point during the test
The Subscriber IIS Server
- These servers needed to accept, parse and process the received messages, so they needed multiple threads and a significant amount of memory, so that each thread could process an incoming message concurrently
- We solved this problem by using several servers, but it could have also been handled by one (or fewer) machines with more memory
The Application Database Server
- This server remained very busy throughout the time period, much more than with the other three tests (the same number of SELECTs and INSERTs would need to be done, but in a shorter period of time)
The following screen capture shows the application database server's performance monitor. Relative to the other monitor shots, it shows most of its memory being used and a much higher level of activity.
Test 5 — Using a Single Zone with multiple subscribers, one for each object
This architecture, although very unusual, is legal under all current SIF specifications.
It uses a single provider (as in the first test), but has multiple subscribers that each subscribe to a single SIF object. In our test, we used the same four SIF objects and the same population sizes as in the other tests:
In this test, we created a script that would begin publishing all four objects (roughly) at the same time into the same zone. After the first few seconds, the LEAInfo and SchoolInfo records made it to their respective agents. Then, the test continued for the next several hours for the other two objects. This screen shot of the ZIServer (Zone Integration Server) audit log screen was taken a few minutes into the test:
The following screen capture was taken to illustrate that, even though a considerable amount of traffic was moving through the network, the bottleneck was still clearly not on either the agent machines, the database machine or the ZIServer machine. Each of the servers shared a small queue during the test — as we looked more closely at detailed traces we saw small waits distributed in many places throughout the system in places where we would have expected them to be (similar results to test 1 above).