May 2016

In March, I had the privilege of attending the premiere big data conference O’Reilly and Cloudera Strata + Hadoop World in San Jose, CA. I’ll describe some of the more interesting topics/sessions in more detail below. The key technology areas and trends that seemed to be the focus were around:

  • Machine learning
  • Streaming and real-time data processing
  • IoT
  • Real-time Analytics

Hadoop Clusters – There were a few sessions/talks about the challenges of managing Hadoop clusters in Enterprise environments. I attended two talks about the topic by GE and British Telecom. GE’s talk focused on how to use a big data platform to change the enterprise culture to a data-driven culture, by opening up the data and creating data lakes where data is accessible by businesses. The BT talk was about successful design patterns for data hubs. Both talks highlighted the enterprise approach on data lakes or data hubs and the Hadoop challenge of job management within the cluster.

IoT – Intel hosted a session showcasing a data-streaming platform that helped Levi Strauss to find its items in a store. This solutions used RFID (IoT) on each item in the store and a machine learning algorithm (that learns over time) where each item should be located in the store. While the application of the technology was a bit simplistic, the platform itself was very impressive.

Machine Learning – Microsoft hosted a talk on machine learning, in which they showcased research on machine learning and neuroscience. Remarkably, they have developed an algorithm that is able to identify basic thoughts just by analyzing electrical signals released from the brain; in this case, the algorithm was able to identify if an individual was seeing a face or a building. They showed images in a few milliseconds to a patient, and the computer, with over 90% accuracy, could guess what picture the patient saw.

Real-time analytics – Something that came up on in a few sessions was the challenge of applying real-time analytics to massive data. There was one use case on credit card security fraud discussed by MapR.  Combining streaming technology and machine learning, MapR impressively made under a 1-second decision to determine if a transaction is a fraud transaction.


Future of Data Analytics

Another key takeaway from the conference included an overview of the future of data analytics. The following diagram created by Amplab – UC Berkely is a great representation. In summary, it shows from the bottom level to the top:

  • Virtualization/distributed file system at the lowest level
  • Compression and encryption at the storage level
  • Spark as the processing engine (Notice no alternative for spark!)
  • Access – Still too many options (I think this is the issue that need to be resolved there is no clear way to access)
  • Applications

Data Analytics Diagram