One of the primary challenges facing business analytics is tackling large sets of unstructured data. To get a feel for just how tricky this is, imagine a strictly structured data set. Data entered into a well designed template, with rules specifying the content of each field, is rigorously structured. Simple programming can search, sort, and sift data of this kind and extract its value.
"Unstructured data" implies hybrid collections of text, image, video, and other kinds of files. Medical databases, for example, can contain written medical records, charts and graphs, x-rays, and other images. Web pages frequently host video and flash video, images and animation, as well as text. Data from such unstructured sources, especially on a large scale (think Twitter or Facebook, of course), is very hard to handle.
The problems involved in analysing unstructured data can be set out concisely:
- Not only is the data heterogeneous; techniques for integrating it into one analytics environment may be heterogeneous, too. APIs are only part of the solution; enterprises also need to consider how to define unconventional data, and in particular how to identify and define what is of value in it.
- Traditional business intelligence environments are designed precisely to handle structured data (the simple database architectures described above). It's unlikely that an existing environment can straightforwardly be adapted to assimilate unstructured data. New tools and thinking are required.
- Advance identification of analytical goals is required if large quantities of "messy" and irrelevant data are not to be imported into the intelligence environment, reducing visibility of valued information
In short, know what you're looking for and why you're looking for it -- and be prepared to innovate.
Vendors offer a range of tools designed to address unstructured data. NAS (network-attached storage) systems, for example, provide solutions for searching and sharing hybrid file content. Their analytical capacity has historically been limited, but there may be prospects for scaling it out. Commercially developed solutions specifically geared to extract value from unstructured data are increasingly available.
Enter the elephant.
Hadoop the elephant, that is, a framework for storing data on a distributed file system, potentially based across hundreds or thousands of servers, and running operations across the servers. It was named after the developer's son's toy elephant.
In a development announced today, IBM is leveraging Hadoop to support InfoSphere BigInsights, an unstructured data analytics tool sitting on the SmartCloud platform. Both free and pay versions are preconfigured and can be operated by clients almost immediately to analyze mixed collections of text, video, images, and social media content.
The launch dovetails with IBM's recent acquisition of Hadoop specialists Platform Computing. It also underlines the importance of IBM's decision, announced in May, to support the open-source Apache Hadoop project rather than create its own version of Hadoop.
InfoSphere BigInsights will allow clients to interact with Hadoop-generated analytics through a user-friendly, browser-based interface known as "BigSheets." In launching a ready-to-use, Hadoop-based solution for unstructured data, IBM is ahead of Microsoft, which is planning to launch a beta-service at year's end.
These are early days, of course, for business analytics in general, and the analysis of large, unstructured data sets in particular, but IBM's deployment of the elephant is a big step in the direction of reliable, real-time analytics with the versatility to master today's hybrid and rapidly evolving information landscape.
— Kim Davis , Community Editor, Internet Evolution