Data Analytics
Last Updated: 14 Feb 2022In contrast to data science, which designs and explores the models and algorithms we have for analyzing data, data analytics is about using data to solve problems. This includes additional steps like considering your stakeholders, communicating ideas, and working under a deadline. If you want to read about machine learning, check out the introduction to machine learning post.
This post is a collection of notes and lessons from the Google Data Analytics Certificate.
The Lifecyce of Data.
Why do we discuss the lifestyle of data? Because data analytics is not just about obtaining results, it’s about collecting, managing, and then closing out a process.
- Plan
- Capture
- Manage
- Analyze
- Archive
- Destroy
Each step comes with its own unique set of challenges. For example, how do you collect and store data securely? How do you use is ethically? How can you account for bias?
Setting up the Problem
The best results come from a well defined question. This means it should be:
- Specific
- Measurable
- Action-oriented
- Relevant
- Time constrained
And finally, it’s also worth considering the type of problem. Which of these are you trying to do?
- Predict
- Categorize
- Detect outliers
- Identify themes
- Discover connections
Different objectives may lead to different problem setups and approaches. The better you understand what you are trying to do, the more tailored you can make your response.
Types of Data
- Nominal vs Ordinal
- Nominal - Choices/responses that don’t have a particular order (i.e. yes/no/maybe)
- Ordinal - Data that has an associated order (i.e. a scale or ranking).
- Internal vs External (i.e. who owns the data? Does it come from outside your organization?)
- Continuous vs Discrete
- Quantitative vs Qualitative
- Structures vs Unstructured (i.e. survey responses, vs pictures)
- Primary vs Secondary
Types of Data Modeling
These types of data modeling are actually pretty universal. The DoD Architecture Framework (DoDAF) also includes the following three models, known as a DIV-1, 2, and 3 respectively.
- Conceptual data modeling gives a high-level view of the data structure, such as how data interacts across an organization. For example, a conceptual data model may be used to define the business requirements for a new database. A conceptual data model doesn’t contain technical details.
- Logical data modeling focuses on the technical details of a database such as relationships, attributes, and entities. For example, a logical data model defines how individual records are uniquely identified in a database. But it doesn’t spell out actual names of database tables. That’s the job of a physical data model.
- Physical data modeling depicts how a database operates. A physical data model defines all entities and attributes used; for example, it includes table names, column names, and data types for the database.
Data Integrity
Data integrity is something to keep in mind through the entire data analysis process. Data with integrity means that is accurate, complete, consistent, and trust-worthy.
One common pitfall is to have biased data. This can occur if you’re using a survey to collect data and use leading or vague questions. Some examples:
- Isn’t it true that A had a negative effect on B?
- What’s going on with A?
Other common pitfalls of the data cleaning process:
- Overlooking missing values
- Only looking at a subset
- Losing track of objectives
- Not fixing the root of the issue
- Not analyzing the system
- Not backing up your data prior to cleaning
- Not accounting for cleaning time in budgeting
- Not checking for spelling errors
- Forgetting to document errors
- Not checking for misfielded values
Speaking from personal experience, another easy to miss aspect of data integrity is redundancy. If you’re managing a large and diverse set of tables, you’ll want to do your best to map out the relationships and generally cleanup any overlap. For example, there should be a single point of truth which maps customer ID’s to their emails. It’ll help in the long run to have this single table which you query from, rather than a set of unlinked tables which all may duplicate the same information.
Data Analysis
This is a bit tough to describe, because the analysis you perform is going to be different depending on your application. But in general, you’ll have a few toolsets you can use. For smaller databases, excel is perfectly sufficient. For anything larger, you’ll want to look at SQL, R, or some other coding language such as Python. Personally, I like using Python with the Pandas library.
Data Visualization
Once you’ve analyzed your data and drawn conclusions, the next step is to turn your insights into actions, which will require you to communicate your ideas.
Principles of good data visualization:
- Information
- Story
- Goal
- Visual Form
You can also review your presentation with these questions:
- What is the practical question?
- What does the data say?
- What does the visual say?
Finally, make sure to consider the use case. For instance, if we need to monitor new data as it comes in, an interactive dashboard with the proper backend support would be best (e.g. Tableau). However, if we wanted historical data, a dashboard could be too complex and now worth the effort.