Time and time again, I encounter instances where data was not explored properly. That’s all we do with data, though, so how is this possible!? By not exploring your data, you run the risk of making ill-informed decisions which could be expensive. That said, you have things to do, and data exploration can be expensive to chase down every rabbit hole. So how do you know how much exploration you need to invest in balanced with the fast-paced need for information?
The level of data exploration needed is directly correlated with the impact the data will have on your organization. High impact decisions requires a more extensive analytics opportunity whereas a low impact decision requires a quick level set with other data sources.
Grab a cup of coffee, and let’s chat through some details and guidelines to help you define a process that works for you.
Why Does Exploring Your Data Matter?
I particularly get this question when I’m chatting with someone not doing any heavy statistical analysis on the dataset. Check out this video for some examples of why this matters, and then read on for more detail around exploration.
Data analytics where you’re breaking out regression analysis or a customer segmentations starts with a process called Exploratory Data Analysis (EDA). However, even quick pulls of data can be prone to mistakes. By not exploring your data, you run the risk of having data quality issues impact your decision. Have you ever run across an instance where data didn’t load overnight? Or a partial data load? Even a quick check against your data you use daily can cause you hiccups if you don’t gut-check it. Exploring your data can be as simple as interacting with it often, but that is NOT fool-proof.
What is Exploratory Data Analysis?
Let’s start here. It is also refered to as EDA, and essentially, it is exploring the data (usually with visuals) before you apply any analysis to it. It’s part of the process before applying models as part of data analytics or data science activities. IB
Exploratory Data Analysis (EDA) is a formal step in the process of analyzing data. It is done to understand characteristics of variables and how they interact with each other before use in statistical modeling purposes and positioning it for use in a model.
This means looking at the type of variables (numbers, text, etc), missing values, crazy outliers that are skewing things, and more. This is the step that takes raw data into a format ready for statistical analysis. This doesn’t mean you’re expected to do all of this necessarily. That’s the point of this article. When do you break out the heavy review versus something lighter?
When Should You Use Exploratory Data Analysis?
If you plan to break out your data science prowess or have your staff do this, you need to break out all the stops on your data. IBM has a great resource on this here if you want the high-level but very data science view. That said…..
Exploratory Data Analysis should be performed if statistical modeling is the goal or if the impact to your business is significant.
Yes, I would break out all of the stops if you are using a metric or set of metrics that could significantly impact your bottom line. By this, I mean a change to how your users interact with your company or how your staff is treated.
Hello, data exploration. So for a statistical analysis, you’re looking for characteristics in the data to know what tools you can use and such. Null values, text versus true/false versus numbers, and a lot more. Each tool in the data scientists tool-belt requires certain assumptions, and you run through EDA to do just that.
If you’re reading this, though, you’re likely not a data scientist. 😉 What kind of analysis am I talking about if you’re not looking at running models on the data?
I hate the ‘it depends’ answer, but it is highly dependent on what you are looking at. Keep this in the back of our mind, but anytime you can go visual in your analysis, the more likely you are to catch hiccups.
- At a minimum, I’d want to know the average, medians, minimum, maximum and spread of the values even remotely touching what I’m measuring.
- How does the value(s) measure over time?
- How do the variables that matter relate to each other? (Are you going to swing another metric an undesirable direction?)
In addition, whatever you are hoping to accomplish, what is your risk? If you totally muck up what you’re trying to do, what is the fall-out? This will help you decide how to test the changes you’re thinking through.
What Data Exploration is Needed for Lower Impact Activities?
I’ve hinted in this post that if you work with data day-to-day, that you might not need to know much more than the general swings you typically experience. If you normally make 50 widgets a day, and suddenly 150 pops up, you should have an idea that you need to look into that. Crazy data quality or perhaps a spike because of some inventory that had gotten misplaced? These instances are more of a gut-check.
Now let’s say, your team has produced their 50 widgets, and it’s noon. Could you let them off early for a morale boost or divert the resources to getting a training they need accomplished? There is a cost here in the resource, so there is a bit of an impact. Did they actually already complete the 50 widgets? Might want to double-check that nothing looks out of whack in the data before you make that call.
Do you get a sense that you’re stepping up what you look at between scenario 1 and 2. You know the data well. No decision was actually being made in scenario 1, so it serves as a way to understand what is going on in the data before you get to a decision point. In the second scenario, you’re talking about taking advantage of downtime or losing the productivity for morale. This is a bottom line impact, so a quick check for accuracy is warranted.
As you increase in impact to your business, you’d want to increase the checks and cross-referencing at a minimum. You know your business, so use your experience to guide the amount of time you or your staff devotes to this type of activity!