The State of Big Data & Hadoop

Is Big Data a thing anymore? I think so… it’s definitely not as sexy as it used to be. Although we certaintly live in the Data Age, I think it has finished its hype cycle and has taken its rightful place in specific environments.

So what is Big Data? Well, it still really just means a collection of data that can’t fit on a single hard drive.

Examples of this include:

the New York Stock Exchange, which generates about one terabyte of new trade data per day. [0]
A 2008 data point showing Facebook data consisting of more than a petabyte of storage for its 10 billion photos. [1]
The Large Hadron Collider in Switzerland producing about 15 petabytes of data per year. [2]

Now that’s BIG DATA. And yet, in the beginning of the Big Data / Hadoop hype cycle, you had companies rushing to install Hadoop in their own environments when they only had < 1 terabyte of data. I saw first hand someone use it for 10GB of data. These shops quickly realized how much of an effort it is to operate a Hadoop cluster, and this operational overhead outweighed the gains in adopting this technology for them. So I think only a niche set of companies really need it at the end of the day, and that’s where we are at now. These smaller companies were better off just using a regular RDBMS such as MS SQL.

The reasons you use Hadoop over a regular RDBMS is due to your data structure. RDBMS is Schema-On-Write, Hadoop is Schema-On-Read. If your data is unstructured, then it’ll be pretty difficult to put a schema on it before you write it to the RDBMS. It’s better to throw it in Hadoop on HDFS as is, then use a MapReduce / Tez / Spark job to go and process it after you figured out how to chop it up. Not only that, but leaving it raw in HDFS gives the opportunity for others to look at it differently than you did. Where I’ve seen a problem is when people take structured data that would perform wonderfully in an RDBMS, be put on HDFS and then ran through Hive. We’re talking small data sets here even. People think that Hadoop will run it faster than MS SQL for instance, because Hadoop is built up of multiple servers. But truth is, it’s going to be much slower than a simple MS SQL setup. Especially if you’re doing a basic select statement. Hive takes almost ~10 seconds to schedule a YARN container to run the job, and even longer if the cluster is busy (or your YARN queues aren’t set up correctly 🙈). These days however, RDBMS systems have caught up, and can out-perform something like Hadoop in some cases. Again, it depends on the data and the processing requirements.

When I was the subject matter expert at a large insurance corp in the US for Hadoop & Big Data, I saw interesting things. Insurance companies have vast amounts of data. We were in the petabytes realm when I was there, and growing. How do you think they come up with quotes for your assets? What determines how much they should charge you? What determines how much profit can be made on your policy while still being able to cover you? There’s a lot of data models that help decide this, and they are based on ubiquitous amounts of information such as current prices of goods & services across the economy, the supply and demand of them, etc. Also, your historical habits are trended and used to score you as a customer. How much you spend money beyond your means, how financially savvy you are, etc. If you are constantly irresponsible and running your credit cards to the max and not paying them off, you are a risky person to cover, and which means how much you pay for that insurance policy is going to be drastically more than the average person. What’s even more surprising, is that this existed long before “Big Data” was coined. Long before Hadoop. Hats off to the data professionals that had to build this stuff back in the day using DB2, mind-numbing SQL and SAS.

These days, all of the data used to generate these results are stored in various data stores, from bubble gum stuck together Excel spreadsheets on SharePoint, IBM DB2, various RDBMS’s, HDFS on Hadoop, to even Microsoft Parallel Data Warehouse. To my eyes, the data world has been a complete mess, and there’s a million ways to store data. This brings a challenge to modern day data processing. Good news for Data Engineers, more painful for Data Scientists. Why? Because you need to clean all of it. Data is dirty, and sometimes older than you, back when standards didn’t really exist. The last thing a Data Scientist wants to do is clean up data. They want to get to the point where they can run their model on it and find the answers. The Data Engineers however, should be loving the cleaning and ETL work, because this is one of their main tasks. The goal is to bring all of the data from this various sources, and put it into a central location to process it and currate it further. In other words, ETL, or ELT depending on your requirements. Recent tools such as Airflow & Luigi make this really fun to do. One of the main goals of the environment I describe in the above paragraph and here is to build a model off of data, feature engineer it, and get it to production so it can be used to make decisions. Your Data Engineers get the data to where it needs to be, maybe do some light machine learning, etc, but the goal is to get the data to a clean and central location for the Data Scientists to model on it. At that point, the Data Engineers will then be responsible for productionizing that model and getting it out the door. At least that’s how it should work. Though unfortunately when I was a Data Engineer, this process was slowed down by office politics, turf wars, and gatekeeping. So we were in the awkward situation of doing Data Engineering and attempting to do Data Science without the full picture. Maybe I’ll write more about this in another post, because it was pretty funny when looking back on it.

What else did I see at the insurance company though? Hadoop is a pain in the ass, and only a very few people know how to scale it, and make it reliable. I so happen to be one of those people, but I don’t know if that’s good or bad. With this, people are abandoning Hadoop. If you aren’t a Fortune 100-500 company, then there’s really no point. Hadoop experts are expensive, opening a job posting is even more, and becoming one costs a lot of time and resiliance to pain. So a common thing you hear now is “Hadoop is dead!”, and when I was at the insurance company hearing an outside consultant say that, my whole organization laughed it off. It didn’t make any sense to us. But I eventually realized what this person was saying, though I think they were too dismissive and didn’t understand the full picture.

Is Hadoop dead?

For smaller & newer shops it may seem so, but they don’t have enough data to even need it. Since they are mostly Cloud oriented, once they do need it they have vast amounts of compute available within a click of a button. There are Python libraries that exist now that can do what you need on one EC2 instance. Dask comes to mind here. I remember working on a pretty annoying data clean up problem with a co-worker. Our on-prem Hadoop was pretty flooded with jobs (we were running around 8,000 a day) and our clean up job was taking too long to get through. So instead, we found a beefy Linux server with 64GB of RAM, 32 CPU threads, and let it rip using Dask, and we were shocked at how much faster it was. Obviously this is subjective and depends on the problem, but we saw this at least when compared to the Hadoop cluster when it was heavily used. The point of Hadoop afterall is / was to have the ability to reliabily store and process unstructured & semi-structured data. These libraries, and future ones, will be able to easily do that as long as you have the proper instance size, which you can save cost on by just using a spot instance until you’re done. But, as with anything, there’s always a counter-argument, and for the above it is that Hadoop-like products do exist and are used in the Cloud, for example Amazon EMR. Newer startups even use it. So I think what it really comes down to is what the Data Engineering team is used to doing, and what they are most comfortable with. With Hadoop already having a big footprint in large corps, I think you will find Engineers that actually do find it more comforting to work in it, and as they move on to other companies, their influence will spread.

It’s worth mentioning that these days, Hadoop seems to not be the destination anymore. Amazon EMR seems to only be used in a lot of cases where it’s temporarily spun up to process data, and then turned off; probably because nobody wants to run and maintain a permanent Spark cluster. I’ve seen shops who were pretty invested in Hadoop, throw all of their raw / messy data in Hadoop, clean & currate it, and store the cleaner versions of it in something like Snowflake. But, you could also just do the same practice by creating a clean data location in HDFS for that currated data. I’m not so sure maintaining two separate platforms like that is sensible; even if one is hosted for you. At the time of writing, Snowflake is at the peak of its hype cycle. A lot of FOMO is happening.

But I know for a fact in large environments like government, insurance, finance, ISPs, Hadoop is the destination and at the center of all data management. My conclusion is it’s not dead, it’s just not hipster anymore to use.

Challenging the status quo in Data Engineering

The old school Big Data environments rely on heavy GUI applications to push data around and to use it. Such tools are Syncsort, IBM SPSS, and Talend. When I’ve encountered these tools in my career I’ll admit, I was very biased against them. I dislike GUI’s because they limit you. You can only innovate and think outside of the box as much as they allow. It’s very difficult to add much needed features to these. But there are trade offs, I get it. If you build your own tool, you can spend all your time maintaining it instead of working. But I believe you can achieve a balance. There are some really awesome tools that are replacing these and making an impact, such as Airflow & Luigi, H2O, and Apache NiFi, which by the way is terrifyingly productive for being a GUI; probably because the NSA created it 😎. The major difference between these and the older tools is that they are open source. Thus, there are features for just about any problem you can face, and if you find something missing that’s desperately needed, you open a PR with the proposed changes. No politics, no bullshit, just pure engineering.

There’s definitely a friction between the new wave of Data Engineering, and the old. For ETL tools at least, there’s a general feeling in the community that these GUI tools need to be replaced, and by tools such as Airflow and Luigi; this new wave even has a new title - Data Engineers. So, funny enough you see the older generations rebranding as Data Engineers, and kudo’s to them, I’d do it too and totally see them as the O.G.’s, but when they do this and not embrace the new age of tooling, that’s when things become sad and office politics become rampant.

I don’t think older tech like this should be completely replaced, they obviously have a purpose. I think there’s a place for them, especially in highly regulated environments, such as insurance. But, I think there can be an easier blend between the new and old. The overall problem is that the Data world is outdated, it needs a reboot. It needs a “DevOps” type of methodology to bring it into the future. Fortunately, this is starting to be a thing with the “DataOps” push.

Summary

Regardless of what you think, Hadoop is alive and well in these big companies, and continually invested into. Yes, Hortonworks and Cloudera have struggled lately, and the recent merger shows this, but I think all of this means that the market footprint isn’t as big as everyone thought. Not everyone has the wallet for it, and it sometimes struggled to deliver on false promises (cough IBM Watson cough).

No matter what Hadoop’s fate is, great things did come out of it, such as Spark, and Kafka. As the deemed Kafka expert at my current gig, I’m excited to see where it goes. Now that Confluent is pumping a lot of investment into it, I’m sure it has a bright future. It has somewhat the same operational complexity as Hadoop, since it’s a distributed environment. I just hope they make it easier to manage, so people adopt it much more. Recent PR’s on their repository show that they are definitely removing the external Zookeeper dependency, so that will be nice.

So what’s the state of Big Data? It’s one piece in the Data Management and Analytics ethos. We’re still figuring it out, and it’s ever-changing. This rambling is an effort for me to answer this question for myself, and to maybe also give ideas to others. Thanks for reading.