
The content served by the two webservers is located in AFS space, a distributed global file system. Care has been taken to ensure that each webserver has a minimal environment, obtaining most of the information from the shared AFS space.
Current Failover Scenario
Since all web content is located in shared AFS space,
either webserver can take over for the other. However, this
is a manual process, including, but not limited to, configuring
IP aliases and starting additional webservers on the failover
machine.
This failover works well for scheduled outages. However, for unscheduled outages, there may be a significant delay in failing over as the appropriate people need to be contacted in order for the failover to take affect. This is especially evident during the off-hours.
Furthermore, the current failover does not solve problems of an overloaded web server. In some cases, major announcements have been known to cause the main webserver, www0 to become overloaded, causing some web requests to fail. There is no current failover/overflow solution to this problem.
Increasing Availability
To increase the availability of the webserver, OSS/DCG will deploy an
Alteon ACEdirector3 Layer 4 load-balancing switch. This
switch will act as a virtual webserver(s) in front of the real
webservers. As connection requests come in, the Alteon will
recieve the connection request and direct it to an appropriate
webserver based on the load-balancing or failover criteria for that
virtual webserver.
Before the switch, only one of the two webserver could host a site at a time due to restrictions with IP addresses, etc. However, the switch alleviates this problem. For example, once employed, the switch is configured to answer to the web site www.sdss.org, and redirects incoming traffic using leastconns1 criteria. The two real webservers: www0 and expwww0 are configured to host the contents of www.sdss.org. In DNS, they will be called www0.sdss.org and expwww0.sdss.org.

The switch will accept connections for www.sdss.org and determine which webserver is least busy based on a leastconns1 criteria. Once the webserver is selected by the switch, the traffic is then directed to that webserver. This will assure that there will be no overload condition unless both web servers are overloaded, maximizing system utilization. The switch also performs health checking. Should one of the two web servers crash, the switch will then redirect all new incoming requests to the webserver which is still functional, performing the failover operation automatically in less than 4 seconds. Likewise, failback will also be handled automatically by the switch once the health checking has deemed that the downed webserver is back in service.
An additional benefit from using the switch is increased maintenance windows. Since failover is now trivial, maintenance downtimes for each server can be scheduled at anytime, including business hours. This could potentially mean dropping 24x7 vendor support on the webservers, reducing cost of maintenance2. System maintenance can now be performed during normal business hours, when system support, both in-house and vendor supplied are most available. System patches, which usually require system reboots, can be installed in a timely manner rather than waiting for a scheduled outage.
Deployment and Migration
The following outlines the deployment of the Alteon switch and
the migration of the current webservers over to the new switched fabric.
| Install Alteon switch | The switch will be installed in FRR3 and configured with IP address 131.225.70.254. At this point the switch will be tested for proper connectivity and routing to other nets. The switch all also be configured to allow Direct Client Access to the two webservers. |
| Connect webservers |
|
| Migrate each website |
|
| Backup switch config | The switch configuration will be backed up in accordance to the DCG scheme for switch backups. |
More information on the web servers to be moved can be found here.
Upgradability and Expansion
The Alteon ACEdirector3 Layer 4 load-balancing switch has 8 100Mbps
ports and a single 1000Mbps uplink. An additional 6 servers can be added directly
to the switch to increase web serving capability. Should additional servers
be needed, layer-2 switches can be connected to the Alteon switch. The webservers
would then be connected to the layer-2 switches. Maximum throughput accessing AFS
space over a 100Mbps link is ~21.6Mbps (2.7MB/s). Therefore no more than 4 AFS servers
should be connected per 100Mbps uplink, for a maximum of 32 webservers.
Should the webserver expansion grow beyond 32 webservers, other options can still be provided including using the second Alteon switch or configuring the Alteon in a one-armed balanced mode with proxy-ip.
Maintenance and Service
The Alteon switch will be maintained in a joint effort between
DCG and OSS. 24x7 software support is provided by the vendor:
Nortel Networks. In addition,
there is next-day shipping for defective hardware.
Notes:
| 1 | The switch performs a real-time check to determine which system has
the least number of network connections and redirects traffic to that
system. This is considered as the best self-governing algorithm for most
network traffic. |
| 2 | More research must be done before concluding that 24x7 vendor support will no longer be necessary. |
| 3 | Using this migration plan, production traffic to each website should continue to be operational since direct client access has been allowed. In the example above, requests to www.sdss.org will flow through the switch unbalanced, directly connecting to expwww0. Only after step 7 has completed, will traffic to www.sdss.org be balanced. However, those clients which may have cached the old www.sdss.org address will still be able to access the website since that access would be through IP and direct client access. |