Guest Column

Just How Big Is Big Data?

By Nick Burch, CTO, Quanticate

Today, Big Data is one of the hot topics within almost every industry. We have seen entire conferences devoted to the subject and many seminars in clinical trials focused on the topic of Big Data.

Despite all of this interest, a great deal of confusion remains around Big Data. Not only are there never-ending debates about what Big Data is, there's a huge range of possible Big Data solutions out there to choose from, only a few of which will be appropriate for any given situation or problem. If you speak to some technologists from Silicon Valley firms, they'll swear you don't have Big Data until you have entire data centers on three continents. Many of the recent VC-backed Big Data splashes have been targeting the "few racks"-sized problem space. Hang out at the right tech events, and you'll see various groups demoing their Big Data solutions on collections of machines that'll fit comfortably in a shoebox.

On the other hand, there's a small backlash against the Big Data movement, with some companies and commentators explicitly saying they have a Small Data problem, and being proud of it. Many of these stress the importance of being able to process everything on one machine and of ensuring that processing is available to all — not just those with large budgets. This desire that all can participate can be countered though, through the use of on-demand cloud systems from the likes of Amazon, which allow anyone with a few dollars to spare on their credit card to spin up their own temporary Big Data system for an hour to do their processing.

Where does this leave us mere mortals though, starting out on our use of Big Data? When we look at potential solutions, potential systems, potential frameworks, how do we know if they are right for us? When the suave salesperson from a Big Data startup phones, or worse, one from a large and expensive legacy provider, how do we know that what they're pitching is of the right scale for our needs? After all, something that scales to a handful of machines won't work when holding medical information for a whole country, while another that works best with data centers on three continents will be an expensive overkill for those in the low-tens of terrabytes of data to process. Both problems are Big Data, but what's right for one won't suit the other.

Considering such ranges of situations, we have to ask ourselves: Despite the hype, the VC funding, the marketing, and the buzz, is the use of a single label to cover the whole space becoming a problem? Is the term "Big Data" as a catchall still useful?

At this point, let us allow ourselves a brief diversion. Where else have we come across multiple different words for “Big”? For those living in places well served by a Starbucks, the answer is every morning! What can we learn from the use of “Tall”, “Grande,” or “Venti” as different measures of big?

On the academic side, Google has released a number of seminal papers on Big Data. Whether we're considering their paper on MapReduce, which led Doug Cutting and friends to re-architect Nutch along those lines (which eventually grew into Hadoop), or their more recent Spanner Works, which rely on known error-bars to allow distributed provably-correct distributed handling of “what happened first,” we see great leaps forward. The computer scientist in me is excited by the prospect of what can be done, and the elegance of what is possible. (If you're at all technically interested in distributed systems, the papers from the likes of Google or Amazon will give you hours of intellectual joy.) The pragmatist in me wants to know how we can solve last Tuesday's customer issue without committing to another rack in the data center. While some solve globally distributed problems, many of us face short-term multi-machine problems. Many of us foresee larger challenges, but not that many orders of magnitude more. Talk to a VC over a beer, or certain researchers, and you will hear of the huge Big Data challenges that exist out there, and the innovative giant projects that help solve them. Compared to what many of us face, it seems a different world. However, all of us are within the “Big Data” space. Faced with these divergent needs, can we really all say we are “Big Data”?

Given this range, how come one term has tended to stick? How much can be explained by the desire not to have a “small” problem, and how much can be pinned onto the desire to follow the buzzword and marketing effects of “Big Data”?

Another challenge we face is the fluidity we see as new systems and products are developed. If you look at many talks from Big Data events from 2-3 years ago, it's striking how many talked of bespoke functionality and hard coding at the time, which are now available as standard in the latest tools. In some areas, what's hard or big today won't be next year, while other challenges remain. A new release might make enforcing security permissions easier, or allow new statistics to be run as standard, but the speed of light remains constant!

Despite this, a problem remains – how can someone new to this field work out which kinds of Big Data problems they have, and identify the right kinds of solutions? Plenty of companies (large and small) claim they have what you need, but how can you check before handing over large sums or spending lots of time? The boring and un-sexy answer is in part the need for Requirements, a clear identification of what your problem is, and what is needed.

If anything, the growth of Big Data has made the up-front gathering of requirements more important, not less. You need to think about where your source data is, and what form it is in. Consider how spread out the data is, and how easy it is to run the data processing/analysis near to where it is. Think about if you can work directly on the source data, or if it needs pre-processing. Think about how fast the data is growing, how quickly you need to include new data in the results. Work out the complexity of your calculations, where the outputs will go, and for what use they'll be put. Decide if 10 machines for 10 minutes with complexity are better or worse than two machines and simplicity for an hour. As best you can identify the problem then pick the solution, not the other way around!

At Quanticate, we've spent a lot of time analyzing our requirements and comparing these to the available systems. Despite several vendors suggesting the "silver bullet" of a single solution, we've opted to use two different Big Data systems that work together to solve the challenges we face. We're using one highly scalable and redundant Big Data load/sort/filter/retrieve/store system which also works with our existing technology stack, and a separate Big Data analytics and statistics system which can process from both our existing and new Big Data storage systems. By using two different systems, with different approaches, we actually end up with a simpler setup which both scales better and more closely meets our requirements.

How can you begin to determine your needs based off of your own problems and challenges? Once you’ve gathered your requirements, can you then group yourself into a “kind” of Big Data, to help you in your search for the right solution? Could you, in effect, say, “I've got 1TB of new data each day to summarize, so I've got a Grande Data problem”? Can we, as an industry, agree on the classifications of the different kinds of Big Data? I'm not sure if we can, but I do know that the best way to solve the challenges we face here on clinical trials management is by being honest with ourselves and picking the right system, regardless of what it’s called.