Implementation Plan for Alteon Layer 4 Switch:
Site WebServers


Introduction

The central webservers are designated as high availability, 24x7 systems by the computing division. The current webservers consist of two Sun Ultra 10 servers, www0 and expwww0. Www0 serves the main Fermilab webserver -- www.fnal.gov. Expwww0 serves as the experiments' webserver. Between the two webservers, there are approximately 42 sites hosted.

The content served by the two webservers is located in AFS space, a distributed global file system. Care has been taken to ensure that each webserver has a minimal environment, obtaining most of the information from the shared AFS space.

Current Failover Scenario

Since all web content is located in shared AFS space, either webserver can take over for the other. However, this is a manual process, including, but not limited to, configuring IP aliases and starting additional webservers on the failover machine.

This failover works well for scheduled outages. However, for unscheduled outages, there may be a significant delay in failing over as the appropriate people need to be contacted in order for the failover to take affect. This is especially evident during the off-hours.

Furthermore, the current failover does not solve problems of an overloaded web server. In some cases, major announcements have been known to cause the main webserver, www0 to become overloaded, causing some web requests to fail. There is no current failover/overflow solution to this problem.

Increasing Availability

To increase the availability of the webserver, OSS/DCG will deploy an Alteon ACEdirector3 Layer 4 load-balancing switch. This switch will act as a virtual webserver(s) in front of the real webservers. As connection requests come in, the Alteon will recieve the connection request and direct it to an appropriate webserver based on the load-balancing or failover criteria for that virtual webserver.

Before the switch, only one of the two webserver could host a site at a time due to restrictions with IP addresses, etc. However, the switch alleviates this problem. For example, once employed, the switch is configured to answer to the web site www.sdss.org, and redirects incoming traffic using leastconns1 criteria. The two real webservers: www0 and expwww0 are configured to host the contents of www.sdss.org. In DNS, they will be called www0.sdss.org and expwww0.sdss.org.

The switch will accept connections for www.sdss.org and determine which webserver is least busy based on a leastconns1 criteria. Once the webserver is selected by the switch, the traffic is then directed to that webserver. This will assure that there will be no overload condition unless both web servers are overloaded, maximizing system utilization. The switch also performs health checking. Should one of the two web servers crash, the switch will then redirect all new incoming requests to the webserver which is still functional, performing the failover operation automatically in less than 4 seconds. Likewise, failback will also be handled automatically by the switch once the health checking has deemed that the downed webserver is back in service.

An additional benefit from using the switch is increased maintenance windows. Since failover is now trivial, maintenance downtimes for each server can be scheduled at anytime, including business hours. This could potentially mean dropping 24x7 vendor support on the webservers, reducing cost of maintenance2. System maintenance can now be performed during normal business hours, when system support, both in-house and vendor supplied are most available. System patches, which usually require system reboots, can be installed in a timely manner rather than waiting for a scheduled outage.

Deployment and Migration

The following outlines the deployment of the Alteon switch and the migration of the current webservers over to the new switched fabric.

Install Alteon switch The switch will be installed in FRR3 and configured with IP address 131.225.70.254. At this point the switch will be tested for proper connectivity and routing to other nets. The switch all also be configured to allow Direct Client Access to the two webservers.
Connect webservers
  1. Schedule brief outage to move network connection on www0 and expwww0 to new switch
Migrate each website
  1. Datacomm request for two new IP addresses for website along with two new DNS names. For example, for the site www.sdss.org, a request would be made for www0.sdss.org and expwww0.sdss.org.
  2. Configure switch with new-website-address and leastconns1 balancing scheme. In this example, the switch would be configured with www0.sdss.org, and would redirect requests to the two real servers www.sdss.org and www0.sdss.org.3
  3. Configure second webserver with the other IP address for the website. In this example, www0 would be configured to host www0.sdss.org
  4. Test switch configuration using new-website-address. Determine that load balancing is taking place. In this example, requests should be made to expwww0.sdss.org (the address of the switch)
  5. Once load-balancing scheme is deemed to be both functioning properly and acceptable, setup all monitoring/health checking scripts for website (if necessary)
  6. Test failover/failback scenarios via monitoring scripts.
  7. Test website specific utilities: auto configure, auto copy, etc.
  8. Make DNS changes to turn website into the virtual website on the switch. In this example the DNS entry for expwww0.sdss.org and www.sdss.org will be switched. Since the switch configuration is based on IP address and not system names, no changes should be necessary on the switch.
Backup switch config The switch configuration will be backed up in accordance to the DCG scheme for switch backups.

More information on the web servers to be moved can be found here.

Upgradability and Expansion

The Alteon ACEdirector3 Layer 4 load-balancing switch has 8 100Mbps ports and a single 1000Mbps uplink. An additional 6 servers can be added directly to the switch to increase web serving capability. Should additional servers be needed, layer-2 switches can be connected to the Alteon switch. The webservers would then be connected to the layer-2 switches. Maximum throughput accessing AFS space over a 100Mbps link is ~21.6Mbps (2.7MB/s). Therefore no more than 4 AFS servers should be connected per 100Mbps uplink, for a maximum of 32 webservers.

Should the webserver expansion grow beyond 32 webservers, other options can still be provided including using the second Alteon switch or configuring the Alteon in a one-armed balanced mode with proxy-ip.

Maintenance and Service

The Alteon switch will be maintained in a joint effort between DCG and OSS. 24x7 software support is provided by the vendor: Nortel Networks. In addition, there is next-day shipping for defective hardware.


Notes:
1 The switch performs a real-time check to determine which system has the least number of network connections and redirects traffic to that system. This is considered as the best self-governing algorithm for most network traffic.
2 More research must be done before concluding that 24x7 vendor support will no longer be necessary.
3 Using this migration plan, production traffic to each website should continue to be operational since direct client access has been allowed. In the example above, requests to www.sdss.org will flow through the switch unbalanced, directly connecting to expwww0. Only after step 7 has completed, will traffic to www.sdss.org be balanced. However, those clients which may have cached the old www.sdss.org address will still be able to access the website since that access would be through IP and direct client access.


For comments, questions, please schedule a meeting =)