Search engine vs Database in BI part 2: structure is value

In the first part, we discussed the differences between structured and unstructured approaches. Now we will try to see how unstructured data tools can help us to structure data.

Adding structure is adding value

The eiffel tower

Without structure, the most visited attraction would have been only a bunch of metal.

As we saw earlier, the more the data are structured the more potential your data have. It is only a matter of hiding the complexity to each group of users, which is not always simple. So let’s stop thinking about the data and take a look at the application and the users.

If we need to evaluate the value of a BI app, one approach I often use is R.E.C.: Reference, Empowerment, Comfort. To empower users, we must have a data model as close as possible as people see their job. The more meaningful, relevant structures we can provide, the more value the app will have. So this is definitely a functional question.

So now that we know we need to provide the most meaningful data structures to users to maximize the value out of our application, these questions remain:

  • when data are unstructured at the origin, what can we do to structure them, in a meaningful way?
  • Should I use a db or a search engine for this?

 

The example: a list of recipes

A search engine is way better to process natural language than a database. It is also really good to fasten full text search. In general, search engines are great to extract info from unstructured structures. But should the data warehouse be also unstructured for BI purposes?

We will discuss where the value of data is located, and see if the functionalities of current search engines are sufficient. A concrete illustration of partially unstructured data is a list of recipes in full text. Let’s say you have this info about a recipe :

recipes

Recipes can be really unstructured ! courtesy of Pirate Johnny (c)

title : Burger

course category : entreecomplexity to prepare: 1 out of 5time to prepare: 10 mindescription: a delicious burger home made with a juicy steak, toasted bread, cryspy lettuce. Can be served with ketchup and love.Instructions : Toast the bread first. Then cook in a frying pan the meat. Salt at the end of the cooking.source: allrecipes.com

 

Now you want to be able to run statistics over a large set of recipes. The factual fundamental part of this recipe is the description, which is unstructured.

 

Extracting business related terms: the thesaurus

From a data perspective, one can see a recipe as a collection, an assemblage of ingredients. A “burger” is composed of “bread”, “steak”, “cheese”, “lettuce” and “ketchup” for instance.
The simplest way to model this situation is to use a tag mechanism. So we use a Full Text Indexer to process the description and extract the tags. Of course, I’ll need to get a base of what is a ingredient and what is not (in our burger “love” is not an ingredient, even if some french people would say so). The list of recognizable words is called a thesaurus.
With the help of our Full Text Indexer and the reference thesaurus, each recipe now have a list of ingredient tags associated with it. Now I’m able to count the ten most used ingredients, count the most used ingredient with steak…
But is using a Full Text Indexer with a simple thesaurus sufficient?

Adding structure to the maelström of tags

Common situation is a synonym or a connex term : a “steak” is indeed a “beef steak”. When one wants to build indicators for all the beef recipes, we need to count a recipe with steak as a recipe containing beef. The tags are connected between each others and we must model this in our thesaurus.

What will really sublime the data is indeed the organization of the tags. This organization requires hierarchies (steak is a specialization of beef, which is a specialization of meat), segmentation (alcohol-free, alcool), collections (japanelo, tacos, burritos are in mexican family).
These structures can’t be handled right now by any search engine I know of, but on the other hand only a few database engines are able to manage a thesaurus imho. So here, a search engine combined with a database really make sense. The search engine for the Full Text Indexer capability, and the database for the structure aspect.