Blog

[Editor’s note: Part of a continuing series about best practices for log file management]

There are a variety of forms of security when talking about computers, the Internet, and smart phones. With respect to log files, the security discussed here has less to do with access to the log files, and is more concerned with access to modifying the configuration of the log files. We’ll look at a few challenges which should be taken seriously and managed accordingly within any size of organization.

While earlier it was stated that log files are ‘intellectual property’, log files tend not to contain any information that is critically confidential because there is no personally identifiable information—though web data should be considered critical business asset. Should the log files of any given organization get in the hands of others not warranted those rights, there isn’t much harm that can be done.

For the most part, even if your log files ended up in the hands of your closest competitor, they would have no advantage—the most useful competitive information is already available through services like Compete. A competitor likely wouldn’t even know what to do with the log files; let alone try to harvest them for information that gives them any advantage in the market.

Best practices and policies

Really, this blog is about policies rather than security. Every organization should have policies in place for who is able to modify the structure of the log files for corporate web assets. Whether it is for the corporate web site, the intranet, the extranet, or other online assets, only a select set of personnel should be able to modify the log file structure and the data points contained within them.

On more than one occasion within our client base, changes which have impacted the data contained within the log files have been made without company-wide notification. As a result, the data can become extremely inaccurate if we continue with the log file analysis.

To complicate matters, many large web sites are spread across multiple servers. When a change is made to a log file, it is critical that the same change be made immediately to the other servers supporting the web site. Should those changes not be made, or be made differently than the first server, the data can then be inaccurate upon processing with most web analytics engines.

A few tips

A common best practice to add a new field to a log file is to add it to the end of the configuration string or log entry such that the new data points are the last in each log entry. If an addition is made in the middle of the string, half the log file entry ‘shifts’ and the data may be attributed to the wrong field given the log file configuration for the web analytics engine. If the new data points are added at the end, they will not impact the previous data configuration. They may not get picked up immediately by the log analysis engine, but at least they will not affect the data negatively.

In short, log files need both security practices and policies to protect them and the data they represent. A great starting point is minimizing the number of people that have access to the logs to help prevent unwanted modification. A record (or a ‘log’) should also be kept about changes made to the log files (which would include the change made, the date, the person that made the change, etc.).

These simple tips can help minimize damage in the event of an error and help to track the responsible parties when changes are made. Keep these in mind to save yourself hours of work, so you can get back to helping visitors and customers.

[Editor’s note: This post is part 6 of a series of posts discussing Log File Management. For more on this topic, be sure to read Tyler’s other posts.]

The previous blog entry entitled ‘Remote Hosted Sites and ISP Policies‘ within this series [Best Practices for Log File Management] discussed the challenges with respect to ISP policies and how they can impact your ability to get good data. The simple answer is to ensure that the ISP provides you access to your log files and that you warehouse them within part of your IT processes. As log files can provide you with an abundance of information, and we discussed earlier how they can be a source of ‘intellectual property’, a simple script can save you a lot of problems with respect to data.

In order to get your logs on a regular basis, you’ll need 3 things:

  • An FTP address (i.e. ftp.yourWeb site.com)
  • A username for the FTP account
  • A password for the FTP account

Once you’ve got this information, just a few lines in a DOS-based batch file and a scheduler on a server, you can download the log from ‘yesterday’ on a nightly basis. In short form, the following roughly represents what would be included in a simple batch file to get a log file from today’s date minus 1 day.

File named: Log-download.bat

ftp
open ftp.yourWeb site.com
[username]
[password]
lcd c:logfiles
get access-dd-mm-yyyy.gz
end

This simple batch file can be ran automatically by task schedulers or cron jobs on a nightly basis.

Now, depending on how your ISP inventories these log files, matters can become complicated in the sense that the date is often within the file name. In order to dynamically access the log from “today’s date minus 1 day”, additional scripting is required. While I consider this to require slightly more advanced knowledge of scripting and Dos, there is a program called ‘doff’ that I found online which enables you to calculate this variation in DOS; a good IT person can manage this aspect of the process for you.

In order to accomplish this, I find the simplest solution is to create a batch file which outputs a secondary batch file. The primary batch file can be automatically run at 1:00am for example and the secondary file (which is actually the output from the primary batch file) can be run at 1:05am. The secondary batch file has a dynamically modified reference in the ‘get’ command for the name of the log file.

Sample code for the primary batch file might look something like the following in order to generate the batch file as outlined above.

File named: Log-download-script.bat


echo open [ftp.site.ca] > logs.txt
echo [username] >> logs.txt
echo [password] >> logs.txt
echo binary >> logs.txt
echo cd [root folder] >> logs.txt
echo lcd "[destination folder]" >> logs.txt
echo prompt >> logs.txt


@echo off
for /f "tokens=1-3 delims=/ " %%a in ('doff mm/dd/yy') do (
set mm=%%a
set dd=%%b
set yy=%%c)


echo get ex%yy%%mm%%dd%.log >> logs.txt


@echo off
for /f "tokens=1-3 delims=/ " %%d in ('doff mm/dd/yy -1') do (
set aa=%%d
set bb=%%e
set cc=%%f)


echo get ex%cc%%aa%%bb%.log >> logs.txt


echo bye >> logs.txt
echo exit >> logs.txt


ftp.exe -s:c:Scriptslogs.txt

Finally, as there are many components and factors involved in this simple task, I recommend that the nightly process downloads that last 3 days, 5 days, or even 10 days of logs depending on their size. This is a failsafe way to avoid lost data due to internet failure, server failure, or numerous other factors.

In short, it is critical to download and warehouse your logs on a regular basis and it can very easily be automated so it does not become a burden among your list of many things to do each day or week.

PublicInsite Web Analytics Inc.

[Editor’s note: For more information on log file management, be sure to read Tyler’s ongoing series of blog posts on the topic starting with Best Practices for Log File Management.]

[Editor’s note: This post is part 5 of a series of posts discussing Log File Management. For more on this topic, be sure to read Tyler’s other posts.]

For small, medium, and even large organizations, it is still relatively common to outsource the hosting of one or more Web sites. While this provides many advantages and can be cost effective, it does pose certain challenges which are often overlooked. Within the scope of this blog, the core challenge here refers to accessing log files.

Depending on the policies of the ISP, log files may be always accessible, may require special permissions, or may not be available at all. When choosing the ISP for your sites, this is a concern which must be considered. For those ISPs which do provide access to logs, it is important to understand that this does not imply that they warehouse days, months and years of log files. In fact, we’ve helped many clients and potential clients identify that their ISPs only warehouse logs for 30 – 45 days as an internal policy and do not provide any means to access files beyond that.

As a case in point, PublicInsite worked with a client with a substantial sized Web site which outsourced a portion of their Web site to an ISP for load balancing reasons. In anticipation of getting an extremely large amount of traffic in a very short time for visitors to download a particular document, they chose to outsource rather than invest in hardware and bandwidth internally (not a bad decision at the time). A few months after launching the new document and having a desire to understand the traffic (i.e. number of downloads of the PDF document), we were told that they outsourced this portion of the site to a local ISP which does not warehouse logs greater than 45 days. What did this mean? It meant that by the time we were contracted to analyze the data, the first few week of logs were long lost. We were unable to identify the true initial demand on days 1 thru 15 of the release of this document, clearly, which would have been the largest volume of traffic (we’ve seen this historically each year so we’re confident of this!).

Regardless of how many days your ISP will warehouse your logs, it is an extremely simple process to create a script which runs daily to download the logs. Once you’ve established this process, all you need to do is ensure that the script is working reliably about once per month.

It’s simple; don’t let your ISPs policies interfere in your ability to do proper historical analysis of your log data. By asking a few simple questions of your ISP and downloading your logs regularly, you will never find yourself find a position like the example above.

PublicInsite Web Analytics Inc.

[Editor’s note: For more information on log file management, be sure to read Tyler’s ongoing series of blog posts on the topic starting with Best Practices for Log File Management.]

[Editor’s note: This post is part 4 of a series of posts discussing Log File Management. For more on this topic, be sure to read Tyler’s other posts.]

It’s not uncommon for a large Web site to have log files in the range of a few hundred megabytes per day. Compressed with a good archiving program (we use ‘GZIp’), the files can be reduced to somewhere between 30 and 50 megabytes. All in all, when added up, a site getting around one million visits per month can store one full year of log files in less than 20 gigabytes of total hard drive space.

If one were to consider the amount of knowledge and evidence-based data that can be extracted from logs, having a few years of historical logs on a few gigabytes of disk space is well worth it. For only hundreds of dollars, not thousands, in today’s fast paced computer market terabytes of disk space can be purchased for these purposes.

No differently than Canada Revenue Agency expects individuals and businesses to maintain documents for tax purposes for at least 7 years, I would recommend a similar rule of thumb. Web sites evolve drastically every few years; therefore, keeping logs for 10 years may not be worth it. However, benchmarking performance, growth, and improvements over the last 5 years is. Without the logs, this would not be possible. Sure you may have reports for each month over the last several years, but once you decide to create one larger report for consistency with the same filter set, its virtually impossible without all the logs.

Now; let’s consider this from an individual’s role within the organization. If you are in a marketing role and your boss or the president of your company asks you for some data comparing last year’s traffic for a sub-section of your site to the current year, could you provide it? Maybe. If you are in an Information Technology role, and someone from the business side of things asks you for some data, could you provide it? Maybe. Regardless of position, by establishing a process to warehouse logs for 5 years through the help of your IT team, these hypothetical situations, which are not all that hypothetical in today’s tough economic times, would be different. The answer would simply be ‘yes’. While you may not have the knowledge or expertise to answer the questions being asked, with the logs, you’d have the option to outsource the work to someone that does. Without the logs, once again you’re left in a tough situation and will likely be unable to answer the requests.

It’s simple; keep five running years of logs and you’ll always be able to answer questions regarding traffic to your Web site!

Tyler Gibbs
Director, Products and Operations
PublicInsite Web Analytics Inc.

[Editor’s note: This post is part of a series of posts discussing Log File Management. For more on this topic, be sure to read Tyler’s other posts.]

[Editor’s note: This post is part 3 of a series of posts discussing Log File Management. For more on this topic, be sure to read Tyler’s other posts.]

What is different between this blog entry and the previous (called “Log File Formatting, Naming, and Compression – Single-server Environment”) is simply the number of servers in the Web hosting environment. When an organization uses more than one server to host a single Web site, log file management can become a bit more complex as can Web Analytics with these environments.

All the principles behind the log formatting and compression remain the same. The difference becomes in the naming of the files. More specifically, there is an added layer of complexity through the addition of a folder structure. In a single log setup, the log files can be stored with an appropriate name in a single folder. When there is more than one server involved, the logs must be named the same and stored in different folders.

For example, if we have a Web site which is load balanced across 2 servers, we could store the logs on a networked drive as follows:

Server 1 Log:
X:Site LogsServer1access_log-2010-06-01-w01.gz

Server 2 Log:
X:Site LogsServer2access_log-2010-06-01-w01.gz

When managing log file analyzers with multi-server sites, it is common to require the same filename for each day of the logs. The name of the log as well as the data within is how the log analyzer ‘stitches’ the data together. When Server 1 and Server 2 log files have the same naming convention, the log analyzer can parse the logs alphabetically, which in turn happens to also be chronologically by date (when a proper naming convention is used). It is important to understand that if the names are not the exact same for each server log file, the import process in an analytics tool may simply import all the logs chronologically from one server, than from the next. If this happens, the analytics engine is not able to ‘stitch’ together a particular visit (which may include pages served on both servers) as a single visit which results in higher and inaccurate ‘visit’ counts (page views will still remain the same however – a page view is a page view regardless).

Clearly we can see that it is impossible to store the 2 log files in the same folder with the same name, therefore, the need for the folder structure!

In conclusion, if your organization is growing and develops the need for load balancing, keep this in mind as analytics are not always considered as a priority when expanding technology needs for Web sites. It’s not complicated and when done from day one, can save alot of lost data and headaches in the future…

PublicInsite Web Analytics Inc.

[Editor’s note: This post is part 4 of a series of posts discussing Log File Management. For more on this topic, be sure to read Tyler’s other posts.]

Best Practices for Log File Management

Best practices for log file maintenance

Cardinal Path blog post

[Editor’s note: Part of a continuing series about best practices for log file management] There are a variety of forms of security when talking about computers, the Internet, and smart phones. With respect to log files, the security discussed here has less to do with access to the log files, and is more concerned with … Read Full Post

Retrieving Log Files for Remote Hosted Web Sites

Cardinal Path blog post

[Editor’s note: This post is part 6 of a series of posts discussing Log File Management. For more on this topic, be sure to read Tyler’s other posts.] The previous blog entry entitled ‘Remote Hosted Sites and ISP Policies‘ within this series [Best Practices for Log File Management] discussed the challenges with respect to ISP … Read Full Post

Remote Hosted Sites and ISP Policies

Cardinal Path blog post

[Editor’s note: This post is part 5 of a series of posts discussing Log File Management. For more on this topic, be sure to read Tyler’s other posts.] For small, medium, and even large organizations, it is still relatively common to outsource the hosting of one or more Web sites. While this provides many advantages … Read Full Post

Storage is Not an Issue; nor an Excuse.

Cardinal Path blog post

[Editor’s note: This post is part 4 of a series of posts discussing Log File Management. For more on this topic, be sure to read Tyler’s other posts.] It’s not uncommon for a large Web site to have log files in the range of a few hundred megabytes per day. Compressed with a good archiving … Read Full Post

Log File Formatting, Naming, and Compression – Multi-server Environment

Cardinal Path blog post

[Editor’s note: This post is part 3 of a series of posts discussing Log File Management. For more on this topic, be sure to read Tyler’s other posts.] What is different between this blog entry and the previous (called “Log File Formatting, Naming, and Compression – Single-server Environment”) is simply the number of servers in … Read Full Post

Benchmark Your Marketing Analytics Maturity

See how your marketing analytics performs against thousands of organizations. (Approx. 5 minutes).