Is the data layer the be all and end all of tag management? In this article, we explore the benefits and caveats of using a data layer versus DOM-scraping techniques.


Whenever I talk to another developer and find out they’re not using version control (a.k.a. Source Code Management system, or SCM) as part of their workflow, I become a little shocked and horrified. There are just too many great reasons for using version control. Both Git and Subversion are free to use, relatively simple to set up, and give you snapshots to go back to anytime you break something in your code. An SCM is indispensable for any team of more than one developer, but it’s just as useful if you’re on your own.

Tons of developers love Git, and although Git does have some really great features when compared to Subversion, there’s one particular benefit to using Subversion that Git users rarely consider. That is, when you use Git to commit hundreds or even thousands of revisions to your local machine… what happens if your hard drive crashes? Unless you also have set up a remote repository and get familiar with pull requests and merges — Git actually requires a little more effort to get this big benefit — and it’s built into Subversion by design.

One of the primary benefits of using Subversion as your SCM solution is that it’s like having an insurance policy against your local machine breaking, your laptop being stolen, or your hard drive crashing. Every time you commit, you’re sending that code to another server. Granted you can do the same thing with a master Git repository on or your own server in the cloud, but unlike Git, Subversion is meant to be run somewhere other than your laptop or workstation.

To extend the insurance metaphor, as great as it is having the code on your laptop backed up to a central Subversion server, it makes just as much sense to protect yourself from the possibility that your Subversion server’s hard drive will crash.

The best way to do this is to create a mirror repository on another server, and use the svnsync program to create a replica of your primary Subversion repository, or repositories, if you have multiple. If you’re familiar with master-slave database replication, Subversion repository replication is quite similar. Following are the steps and a few gotchas I learned this week as I finally had a chance to set up a proper Subversion mirror replication slave server.

Step One: Set up Subversion on a 2nd Server

This guide assumes you’re already running Subversion on one server. We’ll call that source-server from now on. The first step is to set up a second Subversion server to be used as a mirror. We’ll call that mirror-server from here on out.

What’s interesting to note here is that unlike most database replication setups, with Subversion, it doesn’t matter too much what version the server is, or what platform you want to run it on. Unfortunately, I didn’t realize that when I started. I purposely downloaded an older version of VisualSVN 2.06 bundled with Subversion 1.6.17, and later found out I would’ve been better off running the latest VisualSVN 2.5.3 bundled with Subversion 1.7.3. Why? As of Subversion 1.7, you can now use svnsync with the new --allow-non-empty option, which is designed for exactly the situation of starting to sync a mirror when it already has content in it. More details in Chapter 9: Subversion Repository Mirroring in the SVN Book.

In our case, we have our source-server repository running Subversion 1.6.6 on Ubuntu Server 10.04 LTS, and I set up the free version of VisualSVN 2.06 (bundled with Subversion 1.6x) on a Windows 7 PC.  Whether you install from source or from a binary, it’s important to at least know what version of Subversion each server is running.

Step Two: Dump the Source Repository

Unless you’re starting fresh with both an empty source-repo and an empty mirror-repo, chances are you have lots of commits on your current source-repo. You can actually “play back” these changes on the mirror-repo starting from revision 0 forward, but in most cases it’s faster to dump the source and load it onto the master.

#> svnadmin dump http://source-server/svn/source-repo > source-repo.dump
#> tar czf source-repo.tgz source-repo.dump

Note for Subversion 1.6 and below: Unfortunately, the svnsync program has a limitation in that it assumes you’re starting your mirror-repo from revision 0. This is fine if your repository is small with only a few revisions, but can be quite slow if the reverse is true. Anyway, we’ll be performing some manual tweaks to our mirror-repo later on using this dump-and-load technique. If you’re already running Subversion 1.7 or greater, they’ve added a new feature to circumvent this limitation.

Step Three: Create the Mirror Repository

Most guides to using svnsync warn you to never commit to the mirror repository. The reason for this is that you only ever want your replication user (syncuser) to make changes. Otherwise you risk breaking replication on the slave. On the mirror-server open up a terminal or command prompt and type:

#> cd /svn/repos
#> svnadmin create mirror-repo

How to create a new repository using VisualSVN on WindowsIn my case, using VisualSVN makes this incredibly simple. Just click to the administrative interface and right-click on Repositories container create a new repository. Don’t check the box to create branches, tags and trunk! Then add a new user, syncuser. This is the only user that will need access to the mirror-server repositories. More on this a little later.

Now at this point, we have an empty mirror-repo at revision 0. Here’s the first gotcha. The svnsync program needs to store some special properties about its own syncronization activities. It does this by setting some properties on the repository at -r 0. In order to do this there has to be a valid pre-revision property change hook on the repository that calls exit 0. Some tutorials have you simply add the line exit 0 to your script, but I would recommend against this approach because it leaves the door open for some other user to modify properties and muck up the works. This hook is a perfect place to put your check that only syncuser is allowed to do things. Here’s the script I used for the pre-revision property change hook on Windows:

IF "%3" == "syncuser" (goto :label1) else (echo "Only syncuser may change revision properties" >&2 )
exit 1
goto :eof
exit 0

You might get errors such as svnsync: DAV request failed, svnsync: Error setting property 'sync-lock' could not remove a property  if you forget this step. It took me quite a while to come up with the above on Windows — the vast majority of samples online for pre-revision hooks are bash scripts.

#> cd /svn/repos
#> tar xzf source-repo.tgz

Step Four: Load the source repository

This is where these directions differ from most of what you’ll find online. Typically, the next step you normally see in other tutorials is to start the synchronize process with

#> svnsync init X:Repositoriesmirror-repo http://source-server/svn/source-repo

I tried that first, and it did work fine, but I could tell shortly that it would take a very, very long time to sync from -r 0 to HEAD over the network. I subsequently got some extra advice on the Subversion user’s mailing list to perform the sync on a local repository, but the technique I describe here works just as well.

We can import our dump file to our mirror repository with the svnadmin load command as follows:

#> svnadmin load mirror-repo < source-repo.dump

This command can take a while to run, proportional to the size and number of commits in your svn dump file.

Step Five: Manually Set Sync Properties

Now we have two repositories that are exact copies of each other, but they aren’t yet synchronized in a master-slave configuration, and they’re not automatically syncing just yet. If you try to start svnsync initialize command now, you’ll get the following error:

svnsync: Cannot initialize a repository with content in it

This can be really frustrating if you’ve never used svnsync before. As I said earlier, the svnsync program expects to be initialized on an empty repository at revision 0, to play the revision history forward from there. In this case, our repository has thousands of commits in it already, and we want to start up sync from the current revision forward.

To do this, we have to understand what happens when calling svnsync initialize. What happens is the svnsync program creates three special properties at -r 0, for tracking its own syncing activities. These can be seen on an actively mirrored subversion repository with the svn proplist command.

#> svn proplist --revprop -r0 http://mirror-server/svn/mirror-repo
Unversioned properties on revision 0:

You can ignore svn:date; only the svn:sync* properties are relevant to syncing. Okay, now that we know what the unversioned properties on -r 0 are, we’re going to hack our own values into those properties, using the svn propset command. We’ll take these one at a time.

To set the svn:sync-from-uuid property by hand, we need to find out the UUID of the source-server’s source-repo, with

#> svn info http://source-server/svn/source-repo
Authentication realm: <http://localhost:80> Subversion Repository
Password for 'yourusername':
Path: source-repo
URL: http://localhost/svn/source-repo
Repository Root: http://localhost/svn/
Repository UUID: 9d96f4c0-7d9a-42f6-b8c8-54e79b961fad
Revision: 3738
Node Kind: directory
Last Changed Author: jsmith
Last Changed Rev: 3738
Last Changed Date: 2012-03-01 16:38:38 -0700 (Thu, 01 Mar 2012)

Okay, there we can see it in the output, so copy and paste it — you don’t want to type that. Back on mirror-server we can now issue this command:

#> svn propset --revprop -r0 svn:sync-from-uuid 9d96f4c0-7d9a-42f6-b8c8-54e79b961fad
property 'svn:sync-from-uuid' set on repository revision 0

That response means it worked. Okay, next, we can set the last-merged-rev, or the revision that was last merged. To be safe, you should check the current revision number of both repositories, and use the lower of the two, probably your mirror-repo, which would indicate that someone has already committed new code on source-repo.

#> svn propset --revprop -r0 svn:sync-last-merged-rev 3738 http://mirror-server/svn/mirror-repo
property 'svn:sync-last-merged-rev' set on repository revision 0

Again, a successful response. Next, we need to set the source URL on the mirror repository using

#> svn propset --revprop -r0 svn:sync-from-url http://source-server/svn/source-repo
property 'svn:sync-from-url' set on repository revision 0

Great, now we’re ready to tell our Subversion mirror to sync:

#> svnsync synchronize http://mirror-server/svn/mirror-repo
Transmitting file data .
Committed revision 3739.
Copied properties for revision 3739.

You may not see a confirmation message exactly like mine… in my case it just means that the mirror was able to fetch 1 new change from source-repo.

Last Step: Automate synchronization

Now that we have two subversion repositories mirrored, we need to add a post-commit hook on our source-repo that pushes commits to the mirror. This is done by editing the repository’s post-commit hook.  On the source-server

#> sudo vi /svn/repositories/source-repo/hooks/post-commit
svnsync --non-interactive --username syncuser --password XXXXXXX sync http://mirror-server/svn/mirror-repo/ &amp;

That should be it. Commit some code as normal (to source-repo), then browse to your mirror-repo or do an svn info on it to make sure your commit made it over to mirror-server. If so, congratulations! You’ve just completed this tutorial and are twice as safe from Subversion hard drive failure as you were before.

One obvious security concern in the example above is you probably aren’t going to store the syncuser’s password in the post-commit hook. It does not need to actually be placed in clear text in your post-commit hook file, I just wanted to show that to make the point that your source-server has to be able to see your mirror-server and have the syncuser password hashed or stored. It’s not a big deal in our case, since our repos are on the LAN and nobody can fiddle without access to the box. In any case, there’s a variety of methods out there to conceal your subversion password. Storing encrypted passwords on Ubuntu Server without Gnome Keyring… now that’s a whole other story.

Want to improve your organization’s digital analytics maturity? Get started with our self-assessment!

If you’re active in any field of computing you’ve heard the term Big Data thrown around in the past couple of years. If you’re in a business that has lots of data to analyze then you should have a big interest in Big Data, but you may not fully comprehend what we tech geeks are talking about. Big Data has become sort of a buzz word, and for a good reason. Big Data is a very important and growing facet of the modern technological world. My goal here is to give a view of Big Data from the techie standpoint and to introduce you in a general way to some technologies like Google’s BigQuery and Apache’s Hadoop that we techies immediately think of when we hear Big Data.

Big Data Defined

I get excited any time somebody mentions Big Data in connection with a project I’m working on, but I’m usually disappointed because a lot of people use Big Data as a term to emphasize the importance of a data set, rather than to describe the nature of the dataset. The other common misconception is just the sheer underestimation of how big Big Data really is. Do you have a database with 10 million customer records? To a techie that probably fits pretty squarely into the ‘regular data’ realm rather than Big Data.

I recently found a definition that I thought was good. Unfortunately it’s not concise, but I can summarize. Big Data doesn’t just refer to size in gigabytes of a dataset, but also the complexity of that dataset. A Big Data dataset is usually one that has a large volume of data, but also that data tends to be relatively unstructured (especially when it’s compared to the structured data usually found in a regular relational database) or has complex relationships. The full definition and explanation is on MIKE2.0.

Big Data Concepts

To fully grasp the role of Big Data technologies you should first know what I mean when I say MapReduce and NoSQL. These are topics that can get pretty tough, but I’ll define them generally.

MapReduce – MapReduce is a programming model developed by Google for the purpose of processing large amounts of data. If you want to perform calculations on a large set of data then MapReduce is for you.

NoSQL – NoSQL refers to a broad set of database technologies that break from the traditional model for storing data in a structured fashion. In NoSQL databases the emphasis is on quickly storing and reading massive amounts of data. As a trade-off they generally lose some consistency in terms of data access. This means it might take some time for data to propagate to all of the servers, so querying data can result in out of date results. NoSQL implementers should evaluate whether or not it’s it’s important to be able to query new data the instant it’s added to the database.

Big Data Technologies

So hopefully you’ve gathered by now that Big Data is a wide field with a number of things to consider when picking technologies to house and serve your data, and befitting a large technological problem there are a number of solutions available, most of which aren’t a stand-alone solution to the Big Data problem. These software packages that are available to make working with Big Data easier are best used in conjunction with other software and services to make up your whole data management solution. There are many solutions to choose from, but I want to cover just a few of the most popular ones that you’re most likely to run into.

Apache’s Hadoop

Hadoop is a popular open source MapReduce framework managed and distributed by the Apache Software Foundation. Hadoop at its simplest is a framework for distributing MapReduce work across a cluster of many servers. Individual servers can be added or removed from a Hadoop cluster with little effort, so if you anticipate an incoming spike in data then you can add servers and then remove them after the spike subsides. This model of distributed computing across a cluster of inexpensive hardware is typical of most MapReduce frameworks. Apache also distributes a NoSQL database solution and a number of other Big Data software tools as a part of the Hadoop project. The popular data analysis software Tableau actually can integrate with a dataset stored in a Hadoop NoSQL cluster. If you already know how to use Tableau then there’s pretty limited learning curve for data analysts.

Google’s BigQuery

BigQuery is a very cool new service provided by Google for the storage and querying of big unstructured data. Google’s goal with BigQuery is to build a database that can store vast amounts of data and very quickly return results for ad-hoc queries (their goal was to be able to scan a 1 terabyte table in one second). You can access your data with SQL through a browser based interface or a REST based API. It’s important to note that BigQuery is primarily a tool for analysis. You can dump in billions of rows of records and perform fast ad-hoc queries to give you important actionable information about your dataset, but it’s not meant to be a database backend for an application.


MongoDB is a special kind of NoSQL database called a ‘document store’. Mongo is a database that allows you to easily ‘shard’ data across multiple servers. Much like a hadoop cluster you can create a mongo cluster and add or remove servers very easily. Unlike hadoop, mongo is primarily a data storage system meant for the storage and quick retrieval of large quantities of data. In addition mongo is a fairly mature technology and has many features that make it a viable potential replacement for traditional relational databases as the backend database for applications.


Redis is another NoSQL solution, but is very different from MongoDB. Redis stores arbitrary key value pairs only in perishable memory. The goal of redis is super-fast lookup and read times on data and for this reason it competes directly with Memcached as a caching solution. The nature of the in-memory storage of redis is that you must have some sort of on-disk database solution (another NoSQL solution, or even a relational database solution like MySQL). Redis is a great tool for dealing with Big Data in the context of an application that delivers data to many users.

When you work tirelessly to maximize ROI, you know all too well there’s only two ways to achieve this: one is to increase revenue, and the other is to reduce costs. This post is about reducing costs. It’s rare to announce that suddenly you could reduce costs by over 90%, but today that is the case.

Recently, Amazon announced AWS Glacier, a new off-site data storage solution with 99.999999999% durability and it only costs $0.01 per Gigabyte per month. Ars Technica has a great write up on the announcement.

You may want to re-read that. Amazon Glacier is designed to provide average annual durability of 99.999999999% for your archived files. In short, you can be rest assured that Amazon keeps enough copies around that you never need to worry about losing any of your irreplaceable files ever again.

Glacier is extremely low cost. Doing the math, a penny per gig is $10 per Terabyte per month. That’s more than 90% cheaper than using the next-cheapest solution out there, which is Amazon S3. If you’re using any other solution provider to store files in the cloud, you’re overpaying — by a lot.

It’s also secure—transfers are sent over a 256-bit HTTPS connection.

Finally, Glacier is simple to use. A glance at the Glacier developer documentation reveals the same type of easy-to-use REST API as for all the other AWS tools and services.

If you followed our 2-part tutorial on how to use Amazon EC2 and using AWS AutoScaling, you should be aware that Amazon has updated their API and platform SDKs to include support for Glacier. Head over to AWS to get the latest version of the AWS SDK in the language of your choice.

Working with Glacier

AWS Glacier uses a Vault and Archive metaphor. Before uploading archives, you need to create a vault in which to store them. If you’re familiar with S3, or Amazon’s Simple Storage Service, you know that you stores files inside of a bucket. The concept with glacier is the same, but the labels are changed to keep it clear and distinct.

You’ll first need to create an Amazon account and an AWS account if you don’t already have these, and sign up for Glacier service. Next, you’ll have to authenticate against the AWS account in which you want to create Glacier vaults and archives.

Vault operations include CREATE, DELETE and DESCRIBE, as well as some NOTIFICATION features. There are only two Archive operations, UPDATE and DELETE.

As of the announcement, only the Java and .NET versions of the AWS SDKs have been updated to take advantage of Glacier, and the service is very bare-bones in this early release implementation. As this AWS Forum Post helps to explain, Glacier is not yet a fully turn-key solution for all your data storage needs. Some development is going to be required by you to take advantage of these incredible cost savings.

However, the good news is, a few entrepreneurial companies have already jumped on the Glacier train, such as, a company that builds free tools on top of AWS services. With CloudGates you can use FTP, SFTP and WebDAV on top of S3 already, and they’ve also announced this week that they will be adding Glacier support.

Or, What auto maintenance can teach us about website maintenance, Part 2

Last week, we looked at website maintenance in a different light by comparing web site maintenance to auto care and maintenance. We made the case that more companies should treat website maintenance like they treat auto maintenance—it’s just common sense for your car, so why not for your website?

We also said that if business owners can arm themselves with a low-risk, agile application deployment process modeled after modern, scalable, cloud deployment strategies, they can reap benefits in cost savings, maximum performance and give their business the freedom and flexibility to not be at the mercy of any one particular web hosting company or online services vendor.

Get a good insurance policy

In Part 1, we said that regular website maintenance should be more common—as preventative care to prevent disasters—just like performing regular auto maintenance, but we can draw an analogy to auto insurance, too.

In many ways, the idea of being prepared for anything is directly related to an earlier post on risk management as it applies to your web development. Just as having roadside assistance and an up-to-date auto insurance policy can rescue you in a crisis, having all your ducks in a row when it comes to your web application is just as prudent for business owners and their IT departments.

Stranded on the Internet Superhighway

Several years back, I got a frantic call from a client who’s website wouldn’t come up. When I dug into the issue, it turned out the hosting company they had been using (OnSmart — remember them?) had gone out of business, leaving thousands of customers just like my client stranded on the internet superhighway. No support site, no portal, no home page, not even their phone numbers would ring. Would you be prepared for a similar unfortunate situation?

If you have all of the necessary accounts, usernames and passwords handy and the knowledge and the power to move hosts easily — armed with the information in this article — you can be ready for anything, even if it’s as extreme as natural disasters, a power outage or ISP bankruptcy. Though rare, these things can happen, and it is in your company’s best interest to be prepared for anything.

We’re about to tell you everything you need to know in order to be protected against the non-performance of any 3rd parties.

Look to the Cloud

It’s quite common nowadays to spin up application server instances across multiple availability zones, a technique that fans of Amazon Web Services recommend for ultimate high-availability applications. In essence, creating this type of deployment strategy can prevent a lapse in service even when an entire Amazon data center has a hiccup from impacting your business. Traffic just shifts automatically from one data center to another. Amazon has only had issues a couple of times in the past decade, but again, being prepared for anything is becoming the norm. As they say, it’s not if it fails, it’s when.

On the other hand, can we borrow some of the best tools and techniques from scalable cloud deployment to make our website, online application and database platform as flexible as it can be? For instance, couldn’t we copy and tweak our cloud deployment scripts, such as those used with Jenkins, Chef, Puppet or Fabric, to include server backup, transfer, file sync, data replication and migration features? Our source code in a repository providing the means to publish our changes to a production server at the click of a button. So it can’t be that much different to backup and migrate our site to a different ISP, literally at the flip of a switch. Now that developer APIs are becoming more commonplace and expected from our SaaS providers, some DNS providers even make it feasible to automate name server updates as well, for a complete, end-to-end automated hosting migration solution.

Be prepared for anything

Following is a list of the most common accounts you will need to migrate your entire web application at the drop of a hat. It’s not exhaustive list, but most for most common types of web applications, you’ll be required to have access to at least these common accounts at a minimum. If you use WordPress, we’ve included some extra goodies as well toward the end of the list that can also come in handy if you’re in a pickle.

The Top 35 Things You Absolutely Must Have to Move Your Website In 24 Hours or Less

1. The URL,
2. Your Username, and
3. Your password
…to your Domain Name Registry account (aka your Registrar)

I’ve run into a few clients over the years that don’t even remember where they bought their domain name, having clicked on an ad for free hosting, they just signed up for a package deal consisting of domain name registration, DNS management, email and hosting all in one. It pays to use a company that specializes in domain names such as

It is also highly recommended that you review your domain name contacts (there are 4 of them per domain—owner, billing, administrative and technical contacts) to ensure each domain’s contact information, and especially each email address, is up-to-date. We also highly recommended that you consolidate all of your domains to a single account at a single registrar.

4. The URL,
5. Your Username, and
6. Your password,
…to your DNS management account

In most cases, companies simply use the DNS tools provided by their Domain Name Registrar. So these three items may be the same as 1, 2 and 3, but this isn’t always the case, so it’s crucial to know for sure. Both GoDaddy Total DNS, and Network Solutions DNS management tools are top notch.

7. Know whether you’re using your Domain Name Registrar’s default name servers, your DNS provider’s default name servers, or your hosting company’s name servers

Your domain name, where it’s hosted, and how traffic gets there, are three sepearate features required to make a website work—registry, hosting, and DNS, respectively.

So this one can be a bit confusing if you’re not very familiar with DNS management. There’s four distinct, but subtly different options. The first option is that you have your domain name registration, DNS management and hosting all with one company. This used to be more common than it is now, and only a few of the biggest companies are all-in-one anymore. GoDaddy, Network Solutions and are common examples. I’ve never been a big fan of all-in-one solution providers, because all too often, each of the services offered by them are average or poor. You can usually get much better tools and service for a lower price by using individual providers.

If your DNS is managed at your domain name registrar then you’re probably using your registrar’s default name servers, too. This is quite common. However, your DNS can be pointed from your registrar to a 3rd party DNS provider instead. This is often done to get the best service from a specialist that provides only DNS services. DNSMadeEasy and DynDNS are both terrific companies that focus on one thing—offering stellar DNS management tools.

Once you know exactly where your DNS records are updated, you’re ready to record the answer to #7 — you might be using your DNS provider’s default name servers, or you might be sending name resolution downstream to the hosting company where your website files are actually hosted. If you’re not sure, any support tech at your DNS management provider can tell you in a jiffy. In either case, it’s important to know exactly how you have yours set up.

We recommend using your DNS provider’s default name servers and managing DNS on your own for maximum flexibility and speed, because usually registrars aren’t great at hosting and server management, and likewise, hosting companies usually aren’t terrific when it comes to making quick or custom DNS changes.

8. Know what TTLs are for and when to change them

Few web developers even know what DNS TTLs are for and rarely touch the defaults. The concept is so simple though — Time To Live (TTL) values tell all the other DNS servers around the world how many seconds should elapse before they should come back to check back for updates.

If you know you’re going to be making DNS changes in the near future, go ahead and drop your TTLs to one half, one quarter, or even one tenth of the default, typically the number of seconds in a day, or 86400. Realistically, you can set your TTLs to as low 3600 (one hour) or 1800 seconds (half hour). This way when you do make changes, the world will find your new site much, much faster. No need to wait 24 to 48 hours for DNS changes to propagate around the world. TTLs are built-in to all decent DNS tools out there to prevent that lag time. Don’t forget to change your TTLs back to reasonable defaults a day or two after your updates are done.

9. The URL,
10. your username,
11. and your password,
… to your hosting account’s billing control panel.

Your hosting provider will almost certainly email you over and over when your credit card on file is about to expire. Many times, they’ll require to put in more than one possible funding source, too. In any case, there are times when you might fail to receive such email notifications, and it can be a real bummer when your site goes down with a warning about your site being disabled.

Be sure your credit cards and contact info is up-to-date in your billing control panel. Murphy’s law usually indicates that this is most likely to happen when the owner of the credit card is unreachable in the Brazilian rain forest for 2 weeks.

12. The credit limit,
13. expiration date,
… of the billing method on file at your hosting provider.

Clearly if more than one person can spend against this credit card, and they take a client to lavish dinner at a trade show, spending up to within $20 of the maximum credit limit for the card, and tomorrow your $100 hosting auto-bill fails, you could have some serious issues on your hands.

Under this category are additional, related sub-items such as knowing who else can use said credit card, what the spending limits are and what things can cause non-typical or unforeseen expenditures that might encroach upon your credit limit. Examples include videos hosted on your site that go viral and reach your bandwidth quota. See 14 & 15.

14. Your web hosting quotas, and
15. how your common website usage patterns come to reaching these quotas.

Included in your hosting quotas are limits such as hard drive space and bandwidth. There are normal usage for hosting the site to visitors (outbound traffic) and also FTP quotas for uploading. One example of this is if you switch from manual FTP updating to an automated deployment system. It’s not unusual for your developers to get “deployment trigger happy” and send dozens or even hundreds of copies of the site to the server within a short period of time. Another instance is after installing an automated backup tool that creates temp folders and leaves a copy of daily archives on the hard drive, or a botched update that fills up error logs with the same error message repeating thousands of times. MySQL database binary logs can fill up too if you have a replication setup (see 19 – 22).

16. The URL,
17. your username,
18. and your password
… to your web hosting Control Panel

Typically hosting companies use CPanel or Plesk because these tools provide the most features for the lowest cost. Whatever back-office your hosting company provides, you need to know how to get in there to make changes and inspect configurations. This is usually the place where you create FTP accounts, browse the files and folders making up your website content, protect directories and so on. This is also where you typically manage MySQL databases, database usernames and passwords. (See 19 – 22).

19. The Hostname,
20. the username,
21. the password,
22. and the port number
… to connect to your database.

Hand in hand with these four items are the configuration file within your website application where these four items are stored. This config provides the mechanism for your web application to connect to the database. If you use WordPress, these are stored in your wp-config.php file.

23. How to create databases, DB users and passwords, and grant privileges

CPanel and Plesk make this pretty quick and easy, using a point and click interface. Normally you create a database, you create a db username and password, and then assign that user to the database. It takes a minute with 3 steps. If you connect to your MySQL server on the command line, you’ll use the GRANT statement. Review how to do this from your CPanel, but also keep the link to the GRANT syntax handy if you need to manually manage database access.


24. How to backup your MySQL Database

A couple of scenarios can come into play here. In most cases for typical websites, your database is small enough and your site usage patterns are such that you can use your CPanel tools to backup the whole database, and it’s easily done with a few clicks. One additional nice feature of using CPanel is that you can export, gzip and save a local copy all with one click.

If you use the command line, you’ll want to review the mysqldump command. Here’s the basics:

mysqldump -u %USER% -p -h %IP_ADDRESS% --result-file=/tmp/outputfile.sql %DATABASENAME%

For maximum flexibility, work with your hosting company support team or web developer to create a simple backup script that runs daily. Ideally, get these backups sent off-site. S3 storage from Amazon AWS is a fantastic, durable, cost-effective method of storing off-site backups.

25. The hostname,
26. your Username,
27. your password
28. and port number
… for updating web files via FTP (or SFTP).

In most cases, the hostname is the same as your website domain name, but in some cases it may be the fully-qualified DNS name provided to you by the hosting company for the server your site is hosted on. This is often called the “temporary URL” and comes in the automated email that you get when you first sign up for hosting. In some cases, your host might move your site to a difference server without even notifying you, which means you might not even know your temp URL. Good to know if your DNS breaks and you need to backup your site in a hurry. FTP port is commonly 21, and SFTP port might be 22 (or if using FTP-S or Secure FTP, it might be port 443). If you’re not 100% sure, check with your hosting company support staff. In some cases, you may also need an SSH key or to know how to edit the firewall settings to allow file upload access.

29. Know your stack (LAMP? WAMP? WIMP? Other?)
30. Web Server Software,
31. and version
32. Server Side Programming Language
33. and version.
34. Using any custom extensions or libraries?

Your web stack refers to the platform and framework your site uses. Sometimes people also include with that, all the files relating thereto that make up your web application.

You might be running on a Windows server or a Linux server. You might be using Apache, Lightspeed or another web server solution. You might need special libraries or modules installed. You might need specific versions of all of this software.

The files making up your website can consist of lots of different files and formats in dozens of nested folders. You might be using .NET files with .asp or .aspx extensions, PHP files (.php) or Java (.jsp). Apache and PHP are the most common. If you’re using a framework such as WordPress, Drupal, Joomla or Magento, be sure to also take note of the version number of the framework.

Backing up your website content can be a little trickier, because it can consist of hundreds, thousands or even tens or even hundreds of thousands of files, as well as user-generated uploaded assets such as images, PDF files, and more.

For starters, backup or sync your web files via FTP at least once a month. A decent option here is to routinely use an FTP sync program, such as WS FTP Pro or Transmit. By using FTP Sync, the software figures out what files have been changed or added, so download size (bandwidth) is minimized, and it’s much faster.

For the command-line savvy reader, take a look at the rsync program. Again, your hosting company and web developer should be able to help you with a reliably off-site sync solution.

35. Where are your backups? You’ve got backups, right?

Knowing when your backups run, how to obtain them, how do extract them, inspect them for accuracy, and use them to replicate your site are all vital steps if you want an emergency procedure for moving your website. Almost every website hosting company can provide you with backups, and some ensure you this is a support function, your backups are safe and made regularly, and you need not worry about it. But what if your hosting company gets hit by a tornado or earthquake? What if they go out of business?

Companies like Carbonite and Mozy advertise all the time about the prudence of you personally keeping an offsite backup, right? That’s just for your personal photos and music collection. Do you have offsite backups being made daily for your website files so vital to your business? Without offsite, automated backups going to a 3rd party provider in an automated fashion — tested and checked regularly — you could be in for an extremely rude awakening if anything goes awry at your host.

If you use WordPress, Backup Buddy is a great option covering all-important #35. It has a built-in admin panel that lets you enter your Amazon S3 bucket, dropbox, box or other 3rd party storage provider account info, and it backs up your site content and database automatically every day at the time you specify. Nothing could be simpler, and few WordPress plugins are as valuable for disaster recovery. Other alternatives include BlogVault, VaultPress, and WordPress Backup to Dropbox. Using one of these automated WP backup options is strongly encouraged.


  • Registrar Account Info
    • Company, URL, account number, username & password
  • DNS Account Info
    • Company, URL, account number, username & password
  • Hosting Account Info (Billing CPanel, Hosting CPanel)
    • Company, URL, account number, username & password
  • Database Credentials
    • Hostname, Username, Password, Port Number
  • Stack Info
    • LAMP (Linux Apache MySQL PHP) or Windows-based
    • Web Server being used
    • Scripting Language (PHP, ASP, JSP, etc)
  • Backup Strategy
    • Frequency, offsite location, checked

Get the template

To make it even easier to keep track of these crucial items, make a copy of our Google Spreadsheet and just fill in the blanks. Note: You need a copy of this spreadsheet for each domain name that has something running on it!


Well, that’s it. If you’re ever in a dire situation, we hope these 35 crucial items really come in handy and save you from the pain, frustration and agony that can come from a prolonged website outage. Take control today — and even if you have all of these items at your fingertips, double check and update them regularly.

If you want to really be sure that your backup and migration strategy is ready for prime time, work with your IT team to actually simulate an outage of one or more of these services. That’s the best way to know with certainty that your team and your process can recover quickly and you’re ready for anything.

Web Development

Tag Management: data layer/DOM-scraping pros & cons

Is the data layer the be all and end all of tag management? In this article, we explore the benefits and caveats of using a data layer versus DOM-scraping techniques.

How to use svnsync to create a mirror backup of your Subversion repository

Cardinal Path blog post

Whenever I talk to another developer and find out they’re not using version control (a.k.a. Source Code Management system, or SCM) as part of their workflow, I become a little shocked and horrified. There are just too many great reasons for using version control. Both Git and Subversion are free to use, relatively simple to … Read Full Post

Big Data Technology Explained

If you’re active in any field of computing you’ve heard the term Big Data thrown around in the past couple of years. If you’re in a business that has lots of data to analyze then you should have a big interest in Big Data, but you may not fully comprehend what we tech geeks are … Read Full Post

Amazon Glacier provides offsite storage solution for a penny per gig

When you work tirelessly to maximize ROI, you know all too well there’s only two ways to achieve this: one is to increase revenue, and the other is to reduce costs. This post is about reducing costs. It’s rare to announce that suddenly you could reduce costs by over 90%, but today that is the … Read Full Post

How to Migrate your Website from One Host to Another in 24 Hours Or Less

Or, What auto maintenance can teach us about website maintenance, Part 2 Last week, we looked at website maintenance in a different light by comparing web site maintenance to auto care and maintenance. We made the case that more companies should treat website maintenance like they treat auto maintenance—it’s just common sense for your car, … Read Full Post

Benchmark Your Marketing Analytics Maturity

See how your marketing analytics performs against thousands of organizations. (Approx. 5 minutes).