Hadoop – the word – is everywhere, it’s ubiquitous. But, knowing how to spell it, and knowing what it means for your company, are two very different things. So, should supply chain professionals be concerned about Hadoop? Yes, because although at the moment, Hadoop is probably being used most widely for marketing applications, the marketing and demand management functions are becoming more tightly integrated in companies striving to become demand driven.
Hadoop is used widely in marketing because it is best suited to unstructured and semi-structured data: Images, social media data, call center transcripts, and clickstream data, for example. So marketers use Hadoop to improve their understanding of customers and prospects, and their ability to sell them the right product, at the right time, using the right channel. These additional insight can be a gold mine for companies striving to use social media and internet traffic to forecast new product introductions or omni-channel retailers seeking to understand a promotion’s potential lift by channel.
Unstructured data is potentially useful for emerging supply chain risk management applications, to drive a better understanding of the supply chain programs at leading competitors, and for recruiting and retaining supply chain talent.
Discussing Hadoop can be tricky because it’s a bit like the blind men and the elephant. Hadoop is lots of things. Depending on which way you want to look at it, Hadoop is:
- A distributed data management platform – really a cut-down distributed operating system. It is designed to manage and work with immense volumes of data, and scale linearly from just a few to thousands of commodity computers. In its earliest incarnation, it consisted of three parts, one for data management, one for programming, and one to make it all hang together. The Hadoop Distributed File System (HDFS), Map/Reduce, and Hadoop Common respectively.
- Open source. Hadoop originated at Yahoo in 2005 as the infrastructure to support a web search project. Since then, Hadoop has migrated over to the Apache Software Foundation (“Apache”). As such, it is available for anyone to download and use, free of charge.
- An ecosystem. Like many open source projects, Hadoop has spawned a diverse and evolving ecosystem of enhancements, add-ons, and alternatives. Just to name a few, these include Pig, Hive, YARN, ZooKeeper, and Avro. The ecosystem also includes commercial vendors that provide value-added services based on Hadoop.
- Hadoop is really a software project, not a software product. As noted, you can download it free of charge. But, unless you have fairly rare technical skills – or plenty of time on your hands – implementing, scaling and supporting that distribution can be a bit of a challenge. Consequently, a number of companies now provide a more polished software distribution and supporting services. Hadoop is available as a managed service too.
Putting those definitions and technobabble to one side, it’s always important in the technology game to follow the money:
- Commercial Hadoop startups such as Cloudera, HortonWorks and MapR have recently scored massive venture capital investment. Cloudera closed a $900m round of funding in June. Not to be outdone, Hortonworks announced a $100m funding round in March, with an additional $50m investment in June. Likewise, MapR raised $110m in June, with Google Capital leading that round of investment.
- Large mature enterprise IT vendors such as HP, Intel and IBM are backing Hadoop too. HP invested $50m in Hortonworks in June (see above) to drive closer integration between Hadoop and HP’s other big data technologies. For its part, Intel was part of Cloudera’s recent $900m financing round, owns 18% of Cloudera, and has a seat on the board too. IBM has its very own Hadoop distribution and also offers Hadoop in the cloud.
So your IT department really shouldn’t be on the fence about Hadoop because it’s a given. It’s a done deal. It’s going to happen. Hadoop has so much momentum at the moment it’s hard to see an alternate data management infrastructure emerging in the foreseeable future. Almost anyone that wants to manage massive amounts of unstructured (or semi-structured) data will have Hadoop. So, instead of wondering what Hadoop is and whether it’ll be part of your future, get ahead of the game and ponder these more important questions instead:
- What’s the best Hadoop approach for my company? There are three main approaches that each trade off different cost profiles and the technical skills required: Downloading the free distribution from Apache requires intensive and ongoing technical skills; using a commercial distribution reduces the skills burden; pursuing the Hadoop-as-a-Service approach minimizes the technical skills needed.
- What analytic infrastructure are we going to use on top of Hadoop? Hadoop is just a data management platform, a cut-down operating system. By itself, it adds little value to an enterprise. In earlier IT generations, relational databases breathed life into the Unix operation system, and productivity applications made Microsoft Windows pre-eminent. In the same vein, choosing the right analytic database and toolset for Hadoop is more important than Hadoop itself.
Planning further ahead, it behooves supply chain managers to start asking pointed questions of their favorite supply chain application vendors: What hooks are they providing to integrate Hadoop databases, and what plans to they have to incorporate Hadoop as part of the supporting technology behind their own applications?
David White is ARC’s expert on Big Data, analytics, and Business Intelligence.
Leave a Reply