September 25, 2012
abruce

September 25, 2012 • abruce

Big Data Technology Explained

Reading Time: 4 minutes

If you’re active in any field of computing you’ve heard the term Big Data thrown around in the past couple of years. If you’re in a business that has lots of data to analyze then you should have a big interest in Big Data, but you may not fully comprehend what we tech geeks are talking about. Big Data has become sort of a buzz word, and for a good reason. Big Data is a very important and growing facet of the modern technological world. My goal here is to give a view of Big Data from the techie standpoint and to introduce you in a general way to some technologies like Google’s BigQuery and Apache’s Hadoop that we techies immediately think of when we hear Big Data.

Big Data Defined

I get excited any time somebody mentions Big Data in connection with a project I’m working on, but I’m usually disappointed because a lot of people use Big Data as a term to emphasize the importance of a data set, rather than to describe the nature of the dataset. The other common misconception is just the sheer underestimation of how big Big Data really is. Do you have a database with 10 million customer records? To a techie that probably fits pretty squarely into the ‘regular data’ realm rather than Big Data.

I recently found a definition that I thought was good. Unfortunately it’s not concise, but I can summarize. Big Data doesn’t just refer to size in gigabytes of a dataset, but also the complexity of that dataset. A Big Data dataset is usually one that has a large volume of data, but also that data tends to be relatively unstructured (especially when it’s compared to the structured data usually found in a regular relational database) or has complex relationships. The full definition and explanation is on MIKE2.0.

Big Data Concepts

To fully grasp the role of Big Data technologies you should first know what I mean when I say MapReduce and NoSQL. These are topics that can get pretty tough, but I’ll define them generally.

MapReduce – MapReduce is a programming model developed by Google for the purpose of processing large amounts of data. If you want to perform calculations on a large set of data then MapReduce is for you.

NoSQL – NoSQL refers to a broad set of database technologies that break from the traditional model for storing data in a structured fashion. In NoSQL databases the emphasis is on quickly storing and reading massive amounts of data. As a trade-off they generally lose some consistency in terms of data access. This means it might take some time for data to propagate to all of the servers, so querying data can result in out of date results. NoSQL implementers should evaluate whether or not it’s it’s important to be able to query new data the instant it’s added to the database.

Big Data Technologies

So hopefully you’ve gathered by now that Big Data is a wide field with a number of things to consider when picking technologies to house and serve your data, and befitting a large technological problem there are a number of solutions available, most of which aren’t a stand-alone solution to the Big Data problem. These software packages that are available to make working with Big Data easier are best used in conjunction with other software and services to make up your whole data management solution. There are many solutions to choose from, but I want to cover just a few of the most popular ones that you’re most likely to run into.

Apache’s Hadoop

Hadoop is a popular open source MapReduce framework managed and distributed by the Apache Software Foundation. Hadoop at its simplest is a framework for distributing MapReduce work across a cluster of many servers. Individual servers can be added or removed from a Hadoop cluster with little effort, so if you anticipate an incoming spike in data then you can add servers and then remove them after the spike subsides. This model of distributed computing across a cluster of inexpensive hardware is typical of most MapReduce frameworks. Apache also distributes a NoSQL database solution and a number of other Big Data software tools as a part of the Hadoop project. The popular data analysis software Tableau actually can integrate with a dataset stored in a Hadoop NoSQL cluster. If you already know how to use Tableau then there’s pretty limited learning curve for data analysts.

Google’s BigQuery

BigQuery is a very cool new service provided by Google for the storage and querying of big unstructured data. Google’s goal with BigQuery is to build a database that can store vast amounts of data and very quickly return results for ad-hoc queries (their goal was to be able to scan a 1 terabyte table in one second). You can access your data with SQL through a browser based interface or a REST based API. It’s important to note that BigQuery is primarily a tool for analysis. You can dump in billions of rows of records and perform fast ad-hoc queries to give you important actionable information about your dataset, but it’s not meant to be a database backend for an application.

MongoDB

MongoDB is a special kind of NoSQL database called a ‘document store’. Mongo is a database that allows you to easily ‘shard’ data across multiple servers. Much like a hadoop cluster you can create a mongo cluster and add or remove servers very easily. Unlike hadoop, mongo is primarily a data storage system meant for the storage and quick retrieval of large quantities of data. In addition mongo is a fairly mature technology and has many features that make it a viable potential replacement for traditional relational databases as the backend database for applications.

Redis

Redis is another NoSQL solution, but is very different from MongoDB. Redis stores arbitrary key value pairs only in perishable memory. The goal of redis is super-fast lookup and read times on data and for this reason it competes directly with Memcached as a caching solution. The nature of the in-memory storage of redis is that you must have some sort of on-disk database solution (another NoSQL solution, or even a relational database solution like MySQL). Redis is a great tool for dealing with Big Data in the context of an application that delivers data to many users.

Author

abruce
View all posts

Our Picks

Google Marketing Platform Hub

Your one-stop-shop for everything Google Marketing Platform, designed to help marketers stay informed and up-to-date on product news, solutions, how-to’s, and more.

Big Data Defined

Big Data Concepts

Big Data Technologies

Apache’s Hadoop

Google’s BigQuery

MongoDB

Redis

Author

abruce

Locations

Follow Us

Thank you for your submission.

Thank you for your submission.

Thank you for your submission.

Your request has been submitted and a rep will reach out to you shortly.

You may also be interested in...

Message Sent

Thank you for your interest.

Thank you for registering.

You should receive a confirmation email from GoToWebinar with your unique webinar login information. If you do not receive this email or have trouble logging in to the event, please email asmaa.mourad@cardinalpath.com.

Thank you for subscribing!

You're now looped into the world's largest GMP resource hub!

Thank you for your submission.

Thank you for your submission.

Thank you for your submission.

Thank you for your submission.

Thank you for your submission.

Message Sent

Thank you for registering.

Thank you for your submission.

Message Sent

Thank you for registering.

Thank you for registering.​

Paid media spend by Government websites increased a whopping 139% YoY in 2020.

2020 Online Behavior Live Dashboard

Message Sent

Thank you for registering.

Message Sent

Thank you for registering.

Message Sent

Thank you for registering.

2020 Online Behavior Live Dashboard

Thank you for your submission.

Message Sent

Thank you for registering.

Message Sent

Thank you for registering.

Message Sent

Thank you for registering.

Message Sent

Thank you for registering.

Message Sent

Thank you for your submission.

Message Sent

Thank you for registering.

Message Sent

Thank you for registering.

Thank you for registering.

Cardinal Path is continuing with its series of free training. Next we are conducting training on Google Data Studio. Check it out here.

Cardinal Path hosted a live session to connect with you and answer all your questions on Google Analytics.

Get all the expertise and none of the consultancy fees in this not-to-be-missed, rapid-fire virtual event.

Thank you for submitting the form.

Thank you for submitting the form.

Message Sent

Thank you for registering.

Message Sent

Thank you for registering.

Message Sent

Thank you for registering.

Message Sent

Thank you for registering.

Message Sent

Thank you for registering.

Message Sent

Thank you for registering.

Message Sent

Thank you for registering.

Message Sent

Thank you for registering.

Message Sent

Thank you for registering.