explicitClick to confirm you are 18+

How to build a successful website #3: Make it scale

laboratorymikeSep 3, 2018, 5:25:17 AM
thumb_up14thumb_downmore_vert

When building a successful website you have two major hurdles to clear in terms of popularity:

#1: Getting anyone to visit your site

#2: Getting the site to survive being popular enough for you to make a living

Nowadays, in spite of potential censorship it is relatively easy to get discovered. You pick a popular or alternative social media platform, you run with it, and maybe buy some ads (or boosts) for force your content in front of those eyeballs. But if you do become popular and aren't prepared for popularity, you site will suddenly become very, very slow, and people will start bailing.

So, if you've been taking some coding classes and you've got a great idea for a site, let's jump into this post and prepare for your million views a month today!

Basic Types of Scalability

There are two types of scalability to consider:

People hitting my site right now

This is the most straightforward issue you are going to have to deal with: your website sits on a server (or multiple servers), and your setup can handle up to X users (more on that in a moment). Once you hit X, you start slowing down or crashing. In this case, you either need to make your site efficient so X is bigger for your server, or quickly increase the size or number of servers so that you can handle 2X, 5X, or 100X people.

Data building up over time

    If your site records anything from your users, you get to a point where your database is holding so much data, that even simple operations take a long time because the database has to read millions or even billions of items, for every page load. In this case, even low traffic sites are still quite slow, because you have to load up all of that data. In this case, the more you can minimize all of those reads, the better.

Strategies

Avoid the dancing unicorns

    "Dancing Unicorns" are those gimmicky features that look really cool, but take a lot of resources to implement, and do not add much value to your site. Sometimes it is unnecessary personalization, and sometimes it is just a really cool set of statistics that would look cool on a dashboard, but it doesn't really have to go there. Where possible, avoid adding these if you do not have a reason.

    One real-life case of this occurred on an internationalization project I did for a large corporate website. There were a few cases where people wanted to use some fancy IP location tools to perfectly identify every person coming in, but it involved reaching out to an external server, which on a slow day could add 2 seconds to a page load (Slow!). We went instead with a less "expensive"  method of checking browser language, which worked 98% of the time, and for the balance we put them on the slow method. Result: 98% of visitors got sub-second page loads from the start.

Add some caching and bake some cookies

    Caching involves pre-compiling the results of your code, and serving up the same result for similar people. If you are too aggressive with caching you can serve cached content to the wrong people, or updates don't post, but once correctly configured, you can minimize the number of "expensive" operations on your site.

    For an example, on one site project I have, users may list science outreach activities, and add collaborators to a list. since the software is for a customer that has a 10-year reporting cycle, somewhere around year 5-6 they start having thousands of users, making that list load super slow. While there are plenty of database tricks to speed up building the list, our main trick was to cache the list, and update when a user is added or removed, or after 24 hours just in case. Doing it that way, it only takes a few milliseconds to generate the list.

    Another kind of caching is to have users store some of their own data in their own browser, that terrible practice called cookies. Going back to my international site above, every time someone came to the site, no matter how we validated their location, they got a little cookie saying what their country and language is. This could also be changed using one of those flag menus at the top of the site, which on selection would change the cookie. Once the cookie was set, we could bypass all the other logic, so even for the 2% of people who used the slower method, they could enjoy fast pages after 1 page load.

    You could sum up caching as DRY: don't repeat yourself. There is no need to do the same intense work over and over if you can save it and re-use it.

Break things up

    Now if you really get into web development, this is where the money is made. With a single server you can only get so far, and you can only make it so big. So at that point, you have no choice but to break it into multiple servers.

    It would be a mistake to make every server the same. Instead, each server needs to perform a particular function, and in the industry we call these layers. For example, the code goes in the application layer, databases go in the database layer, etc. This way, you can monitor the amount of strain each layer is under, and focus on the most strained/expensive layer. Usually its the database, and this is where tools like Memcached, Redis, and others have been written to further split up the database into stuff being actively used (i.e. every page load), versus content that's being used 5-10% of the time.

    To save you a few months of banging your head on the wall, two pieces of software to check out in this space are HAProxy for load balancing, and Percona XtraDB for making a database cluster. I will be writing up some detailed posts for a few tokens on how to do this, because being able to do this will allow you to serve many thousands of people and do millions of business in a month (I've worked one project where this was the case), and these bits of software truly are the "secret sauce," despite all of them being open source and free to obtain.

Test your site!

      If you want to make sure that your site is going to behave as expected under load, you are going to need to test it. Commercial services like BlazeMeter are available, and in the open source space, Apache JMeter is the established project, but I'm also developing Zeomine, which is an open source crawler that can test sites under load, or do text analysis. Whichever you choose, do make sure to see which pages are the most likely to give you trouble, and address that trouble early.

Talk about this further

    If you like this topic, or are really looking to get going with a web project, talk with me further at my group: Open Source Search and Hosting Development. My goal is to get 100 people self-hosting their website within a year, and if you're thinking about it, come on over and let us help you get started!