There is no “unstructured data” in analytics

When evaluating analytics and Business Intelligence solutions, people often ask whether the software supports unstructured data.

I have a standard reply to this:

“In analytics there is no such thing as unstructured data, just data that structure has not yet been applied to”.

Now, I’m not just being a smart-ass (although that’s probably a part of it). You really can’t do any analysis — in the traditional sense — directly on unstructured data. The analysis you’re looking to do will be on meta-data that’s associated with or derived from the unstructured data in question.

Structured data

The simplest way to describe structured data is that it is any data that would make sense to organize and display in a table or a set of connected tables such as a relational data model. Structured data can be multi-dimensional, hierarchical, or highly complex, making it hard to conceptualize as tabular per se, but nevertheless having clearly defined attributes that describe and define entities in the model.

Anything you will logically put or view in a two-dimensional data table, e.g. in a spreadsheet or a database program is by definition structured data.

Some will refer to data formats such as XML and JSON or tagged collections of documents as semi-structured data, but for the intents and purposes of analytics, these are typically well structured sources that lend themselves nicely to analysis.

Unstructured data

Unstructured data is essentially everything that does not fit the description of structured data above.

There are several types of unstructured data that people want to analyze, such as:

Text: Most commonly, “unstructured data” refers to some sort of text-heavy data source such as a collection of documents or websites, emails or social media posts. Much of the analysis that people are looking to do on this type of data is on meta-data that’s already associated with the source.Tweets are often mentioned as an example of unstructured data to analyze, but a lot of the analysis people want to do is on things like dates and time, Twitter ids, tweeters’ location, follower count, etc., all of which are explicitly available as meta-data with every single tweet through Twitter’s API.Examples of derived meta-data would be keyword indexes (every tweet in a collection of tweets where a certain word, hashtag or Twitter id is mentioned) or sentiment analysis (if the tweet is positive, negative, neutral; angry, funny, excited; etc.) This kind of derived data will require more work, and often specialized software to obtain.
Images: Similarly, the analysis people are looking to do on images is usually on the associated meta-data: time-stamps, geolocation, height and width of the photos, etc. All of this – and a LOT more – is usually available in the media file headers.Increasingly however, people are looking to do some sort of analysis on the content of the images themselves, i.e. photos of cats vs. dogs (or hot-dogs vs. not-hot-dogs), smiling faces or even analyzing text in the photos themselves — anywhere from license plate numbers to character recognition (OCR) of full-page scans. Such sophisticated meta-data will only be obtained using specialized software, and typically applied in the data workflow well before it comes to the analytics software.
Videos: Everything said about images above applies to videos as well, with the addition that the software needed for deriving sophisticated meta-data for videos and the data volumes and computing power involved will be orders of magnitude harder/larger/more expensive.

So, do you support unstructured data?!

As explained above, what people call “analysis of unstructured data” is in fact analysis of structured meta-data about unstructured data assets. Associated meta-data is already structured and lends itself well to analysis. So is derived meta-data, but it may be harder and more expensive to obtain.

Any analytics software can analyze this data. In most analytics software, there are relatively straight forward ways to read associated meta-data, and to obtain simpler kinds of derived meta-data such as text-length or file size.

But if you’re looking for more sophisticated derived meta-data such as sentiment analysis, language detection, or image recognition you’ll want to look into dedicated software that can apply this meta-data. In some cases it is useful if this software is nicely integrated with your analytics software, but in most cases the structured meta-data will land in a database or a data-lake before it is ever imported into analytics software.

Structured data

Unstructured data

So, do you support unstructured data?!

Stay in the loop