Sizing your infrastructure before launch12 Mar 2008
So you got a webapp - How do you decide how many servers to deploy??? Even if you are still in development and don’t have a single outside user you can make an informed decision on how big to build and what your future network infrastructure will look like.
By gathering some data and doing a little load testing you can launch a new application confident in the fact that you know how many users your application will support.
I will outline the process you can use to size your infrastructure. I’ll be discussing it in the context of a web-based application but these methods can be applied to other types of applications. At my last client, Avvenu, half the network communication was not HTTP based and I used these methods to scale it regardless.
At the end of this process you’ll have a spreadsheet where you’ll be able to
plug in arbitrary numbers and get out the scaling information you need. If
bizdev asks “what happens if we close this deal and double our user base?” or
if engineering finds a way to increase server performance by 100% you’ll be
able to quickly answer what the impact on your network would be.
Understanding your usage**
The first step in building our scaling model is to understand how your users use the system. There are a big series of questions that you’ll need to answer to get an idea of what that usage looks like.
First you’ll need to know how many active users to expect in the future. This data often comes from your marketing department.
The data is usually presented something like - in one month we’ll have X active users, in two months we’ll have Y, in three months we’ll have Z. You’ll need all these for your scaling spreadsheet.
Next you’ll need to find out how the typical user either uses the site (for existing sites) or is expected to use the site (for new sites). You’ll want this data in a given time period, such as per week. Some examples of what you’ll want to know are:
- How many times a week does he visit?
- When he visits what does he do?
- Downloads a large file?
- Looks at pages that require a large amount of processing
- How many times and which ones?
- Looks at images that are dynamically created?
- Looks at static pages?
- Uploads Data?
How much data do you have to maintain per users? This includes files, database rows, or in some applications constant open connections. This will also have to be accounted for in your scaling model.
For an existing application you’ll be able to mine your access logs. Always keep and archive these logs when at all possible. They come in handy to mine for useage pattern data. Throw together some scripts to extract the answers from your access logs.
For new sites put together a detailed but not overly technical questionnaire for your product manager. The answers from the questionnaire can be used to model typical visitor usage patterns.
One final note on usage patterns. You’ll find that you’ll have some users that
look at a few pages every couple of months, and then some users who integrate
your site into their daily routine. You’ll need to find the /average/ across
all your active users.
Distilling the estimated traffic
Now you have how many users you have, vs. the activity of each user. You can now determine how many requests your service will have to handle. You can figure this out just by multiplying the number of users against the number of operations and then divide that by the number of seconds in your time period (i.e. a week) to find the average number of operations you’ll have to perform per second.
Important to note, when sizing your bandwidth that file sizes are measured in BYTES and bandwidth in BITS. multiply all file sizes by 8 to find the number of bits they would be when crossing Ethernet.
Once you’ve determined what your average user will do you’ll need to automate that behavior for load testing. Typically you’ll set up a load testing cluster - or just test against your pre-production or development environment on off hours. You’ll need to ensure your load-generating machines that run your load testing scripts do not become your bottleneck. In this phase it is very useful to be running server monitoring and graphing software like NAGIOS and CACTI. Make sure your server graphing captures CPU, Disk, Memory, Network, and process utilization so that you can identify which machines bottleneck and what parts of the machines have to be scaled. Sometimes you’ll think an application should bottleneck on CPU and find it bottlenecks on Memory. This helps you make informed purchasing decisions when you buy new machines for your production environment.
You can set up scripts and use tools such as AB (apache benchmark) to throw traffic at your servers and determine the number of operations per second your servers can handle. You’ll have to try to isolate each class of machine (i.e. DB or HTTP, etc) and determine it’s maximum load. With unlimited resources you could load test a single webserver to determine it’s limits, then throw 100 load-testers against 100 web-servers to find your DB’s load limits. But for most of us this is impractical. So you may have to be clever and try and profile the database traffic generated by the webserver load testing and then create a script to drive simulated load at your DB server directly.
It is important in this step to discover any horizontal scaling issues. If you find adding new servers does NOT increase your capacity as you expect then you’ll need to work with your software engineering team and fix the scaling problems or warn management that their is a likely hard limit of X number of users the system will support.
Peak vs. Average usage
You will need to determine the peak usage hour(s) of your service and how these relate to your average usage.
I have found that your peak usage will typically be double your average usage. If you have no other data then go ahead and size for that.
If you are sizing an existing application you already know your ratio of peak
vs. average by looking at your log data.
Building the Spreadsheet**
TOTAL (users * usage / time-period-to-seconds ) * peak/avg REQUIRED = -------------------------------------------- SERVERS benchmarked-requests-per-second-per-server
Do this for each class of server, web servers, app servers, DB servers, etc. Then make a column for each month of growth. Make your formula round-up the number of servers. you can’t deploy 2.3333333 servers can you?
Often I’ll break this down into the number of active users each server can support. I can then divide the number of projected users and have the number of required servers.
USERS benchmarked-requests-per-second-per-server PER = --------------------------------------- SERVER (per-user-usage / time-period-in-seconds ) * peak/avg TOTAL USERS REQUIRED = --------------------- SERVERS USERS-PER-SERVER
Your total servers numbers can drive other parts of the spreadsheet as well. Every so many servers you’ll need a new Ethernet switch, another rack at the colo, and perhaps increased headcount (try and reduce this by automating as much as possible!)
Make sure your spreadsheet also accouts for the amount of static data you have to maintain per user. For example how many file servers will you need for the files your users upload? How many users will the disks on your DB server support?
Your model should also determine the maximum network traffic at peak times so that you’ll understand when you’ll need to order more bandwidth from your connectivity provider or will need bigger routers and load balancers.
Using this process has allowed me to help size networks for many internet startups and kept my network operations groups from being caught with their pants down. Determining your scalability and using this data to anticipate required infrastructure growth will help you and the rest of your organization have confidence going forward with a growing userbase.