When you do your laundry, you fold the clean clothes and put them into a specific drawer—folded socks should be placed to the left, folded T-shirts to the right. When it comes to data, however, marketers have gotten the hang of
“folding”, but struggle with organizing the data “in the drawer.”
The laundry challenge marketers face is parsing and cleaning marketing data, which is often messy in format. The data comes from various sources, usually unstructured—and there’s often missing data difficult to see at first glance. As customer interactions with brands generate more diversified data, marketers feel pinched to establish data structure that can reveal contextual insights and accurately defined customer segments.
Enter a data technique called “tidy data.” Tidy data (a term coined by Hadley Wickham) involves mapping a dataset’s structure to its meaning. Analysts align rows, columns, and tables (dataset structure) with observations, variables, and types (dataset meaning). The payoff for tidy data is a better view of variables because the database structure is established. This can lead to a variety of activities to improve the systems that rely on the data, ranging from spotlighting bad data to supporting an analytic model such as machine learning.
To achieve a tidy dataset, analysts arrange data into a dataset based on three key principles:
- Each variable forms a column
- Each observation forms a row
- Each type of observation forms a table.
The arrangement may require melting data together and splitting data columns, but the end result should be a simplified table that highlights the observations and variables in a shared data type.
The tidy data structure benefits analysts or a computer model by easing the way data can be viewed for developing repeatable and reliable processes. Programmer Hadley Wickham noted that tidy data is especially suited for programming languages that relies on vectors such as R programming (explained in my earlier post here).
Here’s a quick example of how tidy data should work. Suppose we had a dataset that contained vehicle make, model, engine specification, and the gender of the vehicle owner. Everything is labeled as we want, but there are some tweaks to the data needed.
The listing in Figure 1 occasionally shows a male and a female owning a vehicle model, or if two people of the same gender owned the same model. Also, the dataset shows two aspects of an engine spec, its size and type, combined in a column. Such data overlaps may confuse a program query for learning certain percentages of the population or other nuanced calculations.
Figure 2 shows a tidy data arrangement. Each row displays an observation, and the variables in the dataset are labeled. Note how the vehicle make and model can repeat. We have also separated the engine spec and type into separate columns. The end result is a dataset with data that can be readily accessed by a program for automated processes.
In fact the problems encountered with messy datasets usually arise from how columns and variables are identified and managed But helping data scientists and analysts agree on how a data structure is to be mapped can help alleviate many of these challenges.
So how can a marketer best use tidy data, even if the marketer is not data savvy?