Open versus deep: How we learn from data

When we yearn for data, do we want any loosely-related data or deep data?

One day, when I was about 8 years old, our science teacher taught us about density. She had us sit around in groups of four, placed three bottles in front of each group, and gave each one of us a plastic cup. We must have begun reaching for the wrong bottles, because she immediately yelled that we had to listen to her carefully otherwise the experiment would not be fun.

First, she said, we needed to pour a little bit of the brown-looking liquid, called syrup, into our cup. Then, we had to pour the clear-looking liquid (water). Finally, we had to pour the yellow stuff (oil).

About 15 minutes, arguments over bottles and several mishaps later, my group and I were staring at our cups in silence, unable to hear what was going on around us and the rest of the days activities gone from our minds.

Image courtesy of www.csiro.au

Unfortunately, at the time, I didn’t get the density part of this whole experiment. What I remember being most interesting for me was just the sight of three different liquid forms, chilling very separately in one cup. Before this, I had thought all liquids mixed together into one, and maybe the flavor or colors changed. But this day I learned that just because something looks watery-like does not mean it will be like water.

I cannot help but think of this story when I try to understand the desire for data today. It is a lot like mixing different things together in hopes of getting a very clear picture but sometimes risking away the whole truth.

Digging for data

Globally, there is a strong drive towards opening up data. But data, on its own, is not a narrative. It does not tell any stories about where it came from, why or how it was collected, who was involved and why it is important. But put it together in a certain way and provide an explanation and it becomes a very powerful tool for learning.

This is commonplace. Any student of mathematics will tell you that numbers (or any data for that matter) are useless without a story.

But how deep does the story get? What exactly can we learn from data? Different data – qualitative or quantitative – will tell different things. Some data will be used to compose “light” stories, usually involving straight counting and frequencies. Other data will be used to compose more complex stories, indicating reasons for certain counts and tendencies over time. Everyone working with data today needs to be clear about what kinds of stories they would like to tell, and then go and dig for data. Otherwise, they can find a lot of data out there that seems like it will help, but may take them in misinformed directions.

Take for instance Tanzania’s recent grief with CNN on their reporting of Lake Nyasa/Malawi. Apparently CNN needs to get its facts straight when it claims that the lake belongs to Malawi. But it can easily be justified if CNN provides its sources of data. Nowhere in this news dispute, or in the border dispute in general, is it clearly stated which data source is official for the world to rely on.

The salient surface

Telling data stories is a lot like telling real stories, except with data you tend to deal with numbers instead of facts. Facts can be played with to present a very compelling but shallow account of a sequence of events. Similarly, numbers can be pulled together to sound like something has been well researched. But without a context, both facts and numbers can be used virtually anyhow.

Consider a statistic (coincidentally, a fact that involves numbers) that I mentioned in my last post: “The Tanzanian community online grew from less than 1% of the population in 2004 to 11% in 2010”. The proceeding section on that blogpost went on to argue that the majority of Tanzanians are still offline and that we should focus on tech development around their needs. Just from the given statistic, many questions are raised. How does that 11% of the population actually connect to the Internet? Were those one-time hits or daily users? Of those who are accounted for in the 11%, are they registered voters or does the 11% include non-registered citizens? Also, how does a 10% increase in the number of Internet users compare to how the number of radio listeners increased since radio’s inception in Tanzania?

Those are some questions that might tell a deep story about Internet usage in Tanzania. The statistic I dropped to support my position was what I like to call the salient surface – it is shallow data that is already processed by people unknown to me (in this case, folks sitting in between the National Bureau of Statistics, the World Bank and others via Google). But it’s also the attractive, easy-to-read, repeatable data. There is a ton of deeper data sitting behind it that should be considered if one is serious about understanding Internet usage in Tanzania.

Meta treatment

I’m not here to self-criticize just to take up your reading patience and bandwidth. But what this example points to is the increasing importance of metadata*, which in short is information about data. An example of metadata is on your nearest library-borrowed book: It’s title, author, genre, number of pages and possibly other “labels” are archived in the library’s borrowing system, while the data itself is the stuff you’ve read in it. Since it’s not possible for your librarian to write out the whole book in your account, he or she just stores the book’s metadata, because that specific combination is very likely unique to your book.

An example of metadata for images. Courtesy of Dina Mohamed’s blog.

What I find important about metadata can be said in three ways:

For data diggers – people writing articles in the media, on blogs, for research, etc. – it is becoming more and more valuable to understand where data comes from and how it was collected. It might provide a perspective on the proof used within narratives. It may also help to formulate hypotheses, that is trying to figure out where proposed research questions will lead, because a hypothesis is an automatic benchmark on which to anchor data searches.

For data collectors – people who archive the raw data itself for others to use – tagging data comprehensively goes a long way in getting the data understood. I would argue that half the work lies in giving as precise a picture as possible, with the use of metadata, about the process, reason and situation for which the data was collected.

For people who do a bit of both – metadata is a friend. It will help categorize and index records and, more importantly, it will help the future understand the depth of work done. There is a lot of learning value there. We can either learn salient surface level stuff, or dive into the deep-end of reason. And personally, I think progress rests on the latter.

A quick example of metadata can be seen on this very blogpost itself. It has been categorized (at the top) as an English and Print post, and tagged (at the bottom) as a post relating to Data, Education and Science. These are the tangible, in-your-face metadata that go hand-in-hand with the other themes explored on the blog, but there is a lot more behind the scenes that is used to track this post down, such as its unique post ID number in a database, the date and time it was published, the number of edits that took place before publication, etc.

In the end, I would only wish that we tell and hear stories that make sense because they somehow relate to the world. No point in hearing a story without the story, is there?

* “Meta”was originally a Greek prefix signifying “after”, “beside” or “among”. Aristotle used the word “Metaphysics” to refer the book he wrote “after” his book on physics. Meta was then misinterpreted to mean “beyond” or “above” physical entities. It then arrived through the ages to English, where it is now commonly understood to refer to anything beyond the physical. In “metadata”, data is considered to be the physical object in question, while the “meta” can be considered the non-physical labels that describe it.

Some data sources:

Further reading:

Previous ArticleNext Article
Al-Amin founded Vijana FM in 2009. With over a decade of experience in communications, design and operations, he now runs a digital media consulting agency - Lateral Labs - in Dar-es-Salaam.

This post has 1 Comment

1
  1. “We can either learn salient surface level stuff, or dive into the deep-end of reason. And personally, I think progress rests on the latter.”
    How often do we question the source of the data and the intention it was collected for-the story behind the data- as the author puts it here. I learned a lot from this read.

Leave a Reply

Your email address will not be published. Required fields are marked *

Send this to a friend