What follows is a case study that includes “Documentation of any cleaning or manipulation of data” as part of the deliverables so I’ve included all my step-by-step details.
I’ll start off with R then I’ll duplicate the process using SQL, and finally Tableau. Other analysis done in Python is not included and covered in other posts.

WELLABEAT CASE STUDY

Founded in 2014, Wellabeat is the company that developed one of the first wearables specifically designed for women and has since gone on to create a portfolio of digital products for tracking and improving the health of women.
Focusing on creating innovative health and wellness products for women, their mission is to empower women to take control of their health by providing them with technology-driven solutions that blend design and function. Giving women the tools to reach their fullest potential through personalized wellness solutions aligned with their cycles. Here are the products that WellaBeat offers:

  • Wellabeat app: The Wellabeat app provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Wellabeat app connects to their line of smart wellness products.
  • Leaf: Wellabeat’s classic wellness tracker can be worn as a bracelet, necklace, or clip. The Leaf tracker connects to the Wellabeat app to track activity, sleep, and stress.
  • Time: This wellness watch combines the timeless look of a classic timepiece with smart technology to track user activity, sleep, and stress. The Time watch connects to the Wellabeat app to provide you with insights into your daily wellness.
  • Spring: This is a water bottle that tracks daily water intake using smart technology to ensure that you are appropriately hydrated throughout the day. The Spring bottle connects to the Wellabeat app to track your hydration levels.
  • Wellabeat membership: Wellabeat also offers a subscription-based membership program for users. Membership gives users 24/7 access to fully personalized guidance on nutrition, activity, sleep, health and beauty, and mindfulness based on their lifestyle and goals.

Here are some of the metrics that the app tracks based on information found on their site:

  • meditation rate
  • starts tracking at “first step”
  • starts tracking at “first shut eye”
  • ovulation cycle, temperature, pregnancy term….
  • cardiac coherence!!??
  • stress tolerance, once again is based on heart rate and temperature levels?
  • activity level

Purpose

  1. The goal of this project is to analyze Wellabeat’s provided data to reveal more opportunities for growth.
  2. Focus on one Wellabeat product and analyze smart device usage data in order to gain insight into how people are already using their smart non-Wellabeat devices.
  3. Then, using this information, make high-level recommendations for how these trends can inform Wellabeat marketing strategy.
  4. My report will be directed to the executive team:
    • Urška Sršen: Wellabeat’s cofounder and Chief Creative Officer
    • Sando Mur: Mathematician and Wellabeat’s cofounder, key member of the Wellabeat executive team

Deliverables

I will produce a report with the following deliverables:

  1. A clear summary of the business task.
  2. A description of all data sources used
  3. Documentation of any cleaning or manipulation of data
  4. A summary of my analysis
  5. Supporting visualizations and key findings
  6. Top high-level content recommendations based on the analysis

Business task

The executive team believes that analyzing smart device fitness data from other products on the market could help unlock new growth opportunities for the company. I have been asked to focus on one of Wellabeat’s products and analyze external smart device data to gain insight into how consumers are using their smart devices. The insights I discover will then help guide marketing strategy for the company.

Scope/major activity

data collection

  • FitBit Fitness Tracker Data (CC0: Public Domain, dataset made available through Mobius): This Kaggle dataset contains personal fitness tracker from thirty fitbit users. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits.
  • It is believed that the provided dataset has limitations.

identify one product

After reviewing WellaBeats site I realized that the  information provided by the company differs from what is actually displayed on the current site. I’ll use of the following products to focus the analysis. All of the listed products provide the same data to the user, the varying features are the types of devices they  are:

    • Ivy+ – wrist wearable tracker
    • Leaf Urban – could be worn as necklace, in a wrist band
    • Leaf Chakra -could be pinned to a collar, blouse, placed in wrist band

create recommendations

  • What are some trends in smart device usage?
  • How could these trends apply to Wellabeat customers?
  • How could these trends help influence Wellabeat marketing strategy?

Ask

There are many questions I have that went unanswered;

  1. Why are we using data from 2016?
  2. The data provided appears to target the fitness/workout community: steps per day, workout type, workout duration… while WellaBeat appears to be more of mental wellness for pregnant women!?
  3. Is the company thinking about branching into that highly competitive domain?
  4. Why doesn’t WellBeat provide its own data for analysis?
  5. Does WellBeat collect data?
  6. Does WellBeat know how to track the metric most used or viewed by its users?
  7. Does WellBeat know how to track pages in the app that are most viewed by its users?
  8. Why isn’t the science behind the claims described on the site more transparent? Being open with the science behind the metrics might create a loyal customer base!
  9. WellaBeat claims it targets women, does WellBeat know exactly how many of its users are women?
  10. Does WellaBeat know the age range or concentration age of its user?
  11. The site makes it obvious that the target market are pregnant women, does WellaBeat have any idea how many of their users are/were/will be pregnant?

Prepare

data

  • Data is old: 3/2016 to 5/2016
  • Covers 30 users of FitBit Fitness tracker
  • Period of 62 days
  •  After reviewing Wellabeat’s site it appears that the data provided is geared more towards fitness rather than monitoring what Wellabeat monitors
  • Workouts metric used by WellaBeat is in common with the data provided
  • Activity level and calories burned are additional common metrics
  • One very important metric in the  data is the manual weigh-ins that’s not been tracked by WellaBeat
  • The data provided shows activity levels, distance walked, calories burned…. are those manually triggered by the user or does the product continuously collect data? What I need to know is if the user actually turns on the data tracking manually or does the app track continually. I’d like to know how invested are the users. In other words if the app tracks all that information automatically, then how do we know if the user pays attention, reviews the data or makes life decisions based on what’s been tracked? In other words, is the data provided for ananlysis helping any of the users, or just metrics that the user doesn’t review. If the user does review it then how do we know how many of the users actually review the data, how helpful are these metrics to the user? Are we solving a problem? I have a nike app on my phone I used 2-3 times to track my running path, length and distance. I haven’t used it in years and yet it still tracks my every day movements! That’s data it collects that I never look at or even use the app.
  • Data doesn’t mention the gender, race, or age of the 30 users
  • Activity level measured in the app? how is that actually calculated? is the app linked to an exercise device?
  • Does the app track the distance traveled by the phone and if the movement is greater than a specific speed it assumes it’s a hight intensity workout?! ….
  • With all these questions, it’s obvious that there is a clear bias in the data

PROCESS WITH R

Note: I will be using R, SQL, and Tableau for processing and analysis, most of the steps done in R will be repeated in SQL

Cleaning with R

read data

#______________________________INSTALL PACKAGES
install.packages("tidyverse")
install.packages("skimr")
install.packages("janitor") install.packages("ggplot2")
#______________________________LOAD PACKAGES
library(tidyverse)
library(skimr)
library(janitor)
library(ggplot2)
library(lubridate)
library(stringr)
#______________________READ DAILY ACTIVITY DATA FROM MONTHS 3 TO 4 OF 2016
dailyactivity_df_3_4 <- read_csv("dailyActivity_merged_3_4.csv")
head(dailyactivity_df_3_4)
#_______________________FILTER TO RELEVANT COLUMNS
activity_3_4 <- dailyactivity_df_3_4 %>%
   select(Id,ActivityDate, TotalSteps, TotalDistance, Calories)
#________________________CONVERT FROM <char> TO <date> AND GROUP BY Id
edited_activity_3_4 <- activity_3_4 %>%
   mutate(mDate = mdy (ActivityDate)) %>%
   group_by(Id) %>%
   arrange(mDate, .by_group = TRUE)
View(edited_activity_3_4)
OUTPUT
# TOTAL 457 ROWS
observations

The obvious path would be to visualize the correlation between all the metrics collected in the data such as calories burned vs workout type, workout length, workout load type… but in my opinion that would be a waste of time and is irrelevant to our business task. It’s been known for centuries what the relationship is between working out, workout load, time under stress, and calories burned.
Each variable contributes directly to calories burned, we are not doctors nor would 30 users monitored for 62 days change the science, so I will not spend time on the majority of the data provided.
My focus will be on UX

  • Does the product provide a solution to the users?
  • Is there a benefit to the user for using the product?
  • There is one metric that’s very important and displays active interaction with the users and that’s manual weigh-in
  • For the user to take the time of getting weighed, and then manually enter the information in the app is extremely important. It’s possible that the app used to collect the data could be linked to a bluetooth or wifi enabled scale and automatically logs in the weight once a value is read!?
  • That would be a great service/device that could be a focus for WellaBeat.

read data

#_____________________READ WEIGHT LOGIN FROM 3 TO 4 OF 2016
weightloginfor_df_3_4 <- read_csv("weightLogInfo_merged_3_4.csv")
#Convert the Date column from <chr> to <date> and groupby ID just like we did earlier
#it appears that most of the Dates are displayed as 11:59:59 PM so we have to assume that maybe the system
#purges at midnight so let's edit the date to drop the hms.
#It appears that when we drop the hms we get NA so we'll keep it as
#mdy_hms and convert from date-time to date with as_date() after we convert the <chr> to datetime
edited_weighin_3_4 <- weightloginfor_df_3_4 %>%
     select(Id, Date, WeightPounds, BMI) %>% 
     mutate(mDate = as_date(mdy_hms (Date))) %>% 
     group_by(Id) %>% 
     arrange(mDate, .by_group = TRUE)
View(edited_weighin_3_4)
OUTPUT
# TOTAL 33 ROWS
observations
  • What’s interesting is that 33 out of 30 users logged in to the manual weight-in page for March. Do we assume some are duplicates?
  • Maybe they created multiple accounts?
  • Is it a requirement for all users to login to the weight-in page before the app starts tracking and making calculations?
  • Are the calories burned calculations based on weight of the user?
  • Is it possible that once the user installs the app and create an account it automatically shows that they logged in to the weigh-in page?
  • I need to dive into this data more and see how many of the users may have entered their weight and how many just logged in and never entered their weight.
  • Is this data from newly enrolled users? or
  • Was the data extracted from a random timeframe?

merge data

#___________________________MERGE DAILY ACTIVITY AND WEIGHT-IN FOR MARCH _______________________
merged_3_4 <- left_join
              (edited_activity_3_4,
              edited_weighin_3_4,
              by =c("Id","mDate"))
View(merged_3_4)
OUTPUT
# TOTAL 457 ROWS
#_____________________REPEAT ABOVE PROCESSES FOR DAILY ACTIVITY DATA FROM 4 TO 5 OF 2016_________
dailyactivity_df_4_5 <- read_csv("dailyActivity_merged_4_5.csv")
#_____________________FILTER DATASET
activity_4_5 <- dailyactivity_df_4_5 %>%
   select(Id,ActivityDate, TotalSteps, TotalDistance, Calories)
#_____________________CONVERT DATE COLUMN
edited_activity_4_5 <- activity_4_5 %>% 
   mutate(mDate = mdy (ActivityDate)) %>% 
   group_by(Id) %>% 
   arrange(mDate, .by_group = TRUE)
View(edited_activity_4_5)
OUTPUT
# TOTAL OF 940 ROWS
#_____________________READ WEIGHT LOGIN FROM 4 TO 5 OF 2016_________
weightloginfor_df_4_5 <- read_csv("weightLogInfo_merged_4_5.csv")
#_____________________FILTER DATASET AND CONVERT TOGETHER
edited_weighin_4_5 <- weightloginfor_df_4_5 %>%
   select(Id, Date, WeightPounds, BMI) %>% 
   mutate(mDate = as_date(mdy_hms (Date))) %>% 
   group_by(Id) %>% 
   arrange(mDate, .by_group = TRUE)
View(edited_weighin_4_5)
OUTPUT
# TOTAL OF 67 ROWS
#______________________MERGE BOTH APRIL DATASETS
merged_4_5 <- left_join
              (edited_activity_4_5,
               edited_weighin_4_5,
               by =c("Id","mDate"))
View(merged_4_5)
OUTPUT
# TOTAL OF 940 ROWS
#______________________BIND THE TWO MERGED TO YIELD FILTERED ENTIRE DATASETS 
# SHOULD HAVE 457 + 940 = 1397 ROWS
merged_data_3_5 <- rbind(merged_3_4, merged_4_5) 
#____________________REVIEW STRUCTURE OF ALL COLUMNS
str(merged_data_3_5)
#______________________FILTER OUT NOT RELEVANT COLUMNS
data_3_5 <- merged_data_3_5 %>% 
   select(Id, mDate, TotalSteps, TotalDistance, Calories, WeightPounds, BMI)
#_____________________SAVE AS NEW DATASET _____________________
file.create("bellabeat_cleaned_filtered_3_5.csv")
write_csv(data_3_5,"bellabeat_cleaned_filtered_3_5.csv")

ANALYZE WITH R

logins

Let’s see how many users actually manually logged in to to the weight app:

#__________________LET'S SEE HOW MANY USERS LOGGED IN TO THE WEIGHT APP
logins <- data_3_5 %>% 
            group_by(Id) %>% 
            summarize(number_of_logins = n()) %>% 
            arrange(desc(Id))
We get 35 users that logged in to the weight-in app. The table is below with the weighins table

weighins

#_______________LET'S SEE HOW MANY OF THE USERS ACTUALLY ENTERED A WEIGHT (WEGHINS)
weighins <- data_3_5 %>%
          ungroup() %>%
         filter(WeightPounds != "NA") %>%
         group_by(Id) %>%
         summarize(number_of_weighins = n()) %>%
        arrange(desc(Id))
As you can see only 13 out of the 35 users actually entered their weight in the app.

logged & weighed

#________________LET'S JOIN THE TWO TOGETHER TO SEE HOW MANY LOGGED-IN AND WEIGHED-IN
login_weighin <- left_join (logins, weighins, by = ("Id"))
#___________I COULD USE SLICE() TO FILTER OUT THE RELEVANT ROWS BUT THE TABLE IS SMALL ENOUGH TO READ

NOTE: it is extremely important to not join logins and weighins after you execute gt() above. So perform the join and get table printed out on the join then go back and gt() the previous two tables if you wish to output the prior two into gt(). If you do you’ll get this error:

Error in UseMethod("left_join") : 
  no applicable method for 'left_join' applied to an object of class "c('gt_tbl', 'list')"

observations

The data is promising

  • out of 30 users that were tracked (the data show 35 users which might be from duplicate accounts?!) so out of the 30 users 13 actually logged in to the weight app on a regular basis as is displayed in the number of logins column
  • Over a 60 day period you can see in column 2 that the users logged in on an extremely consistent basis, 30, 40 and over 50 times
  • 13 users entered their weight in the app. Once again, I really have no way of knowing if the weight-in side of the app is any different from the main app, but I have to assume that it is because the login information is included in the weight-in data not in the daily activity data.
  • 6 out of 13 entered their weight in 1 time over a 60 day period. I need to follow up on these users to inspect their activity levels over the time tracked. Maybe their lone entry could be explained by their acitivity level.
  • 5 of of the 13 entered their weight multiple times. Again over a 60 day period you really don’t expect much change in weight unless you are trying to actually gain or lose it. So once again we need to investigate their activity level to see if there is a correlation between multiple entry and activity, maybe they started or finished on a program?
  • 1 user had 33 and another user had 44 weigh-ins. That seems to be extreme as to someone is either on a hardcore weight loss program or maybe pregnant or possibly post pregnancy. Once again activity level needs to be investigated.
  • Overall 13 out of 30 users is a great percentage and that appears to be a function that provides a great UX and needs to be added to BellaBeat.

ANALYZE WITH SQL

unique Ids

Let’s look at how many unique Ids we have, we were told that the data was provided on 30 users. The data generated is saved in file: unique_Id_total_35.csv

NOTE: I’m using a condensed code spacing and indentation to save on screen size

SELECT  COUNT(DISTINCT(Id))
FROM    `prime-depot-415622.bellabeat.bellat_merged`
The code gives us 35. 

weightLogIn entries

Let’s see how many entries in the weightLogInfo dataset  were not NA for both columns BMI & Weightkg/pounds. In other words how many times did the user enter a value for either BMI or Weight out of the 1397 total entries. Data is saved in file: weight_or_bmi_entries_100.csv.

SELECT   Id, mDate
FROM     `prime-depot-415622.bellabeat.bellat_merged`
WHERE     BMI <> "NA" AND WeightPounds <> "NA"
The answer is 100 rows. 

weightLogIn users

Let’s see how many of those weightLogIn entries belonged to unique users. In other words how many users actually logged in their weight. Data is saved in file: users_that_weighedin_13.csv 

SELECT  Id, COUNT(*) AS number_of_weighins
FROM  `prime-depot-415622.bellabeat.bellat_merged`
WHERE   BMI <> "NA" AND  WeightPounds <> "NA"
GROUP BY  Id
ORDER BY  number_of_weighins
The answer is 13. So out of the 35 unique users 13 users actually manually?! entered their weight. That's 37%.

multiple entries

Let’s see how many users had multiple entries of their weight:
A brief explanation of the code below:

  • Start from the subquery within (colored in blue) : Select the Id column, and then COUNT how many times each ID which is GROUPED BY has loggedin their weight.
  • We can find that out by filtering the dataset with the WHERE clause that only looks at when values for BMI AND WeightPounds were entered and NOT NULL or =”NA”
  • Then we  group each entry by ID and count them and
  • Name the subquery AS subquery
  • Now let’s step back out to the outer query and SELECT what we just did in the  subquery and pull the ID and
  • Assign the count of the number_of_weighins to multiple_logins because we will use a
  • WHERE clause to just display the times a user logged in multiple times to enter their weight
  • That value comes up to 7
  • Breakdown of the multiple logins are as folllows:
SELECT
     subquery.Id,
     subquery.number_of_weighins AS multiple_logins
FROM
    (
     SELECT
         Id,
         COUNT(*) AS number_of_weighins
     FROM
         `prime-depot-415622.bellabeat.bellat_merged`
     WHERE
         BMI <> "NA" AND WeightPounds <> "NA"
     GROUP BY
         Id
     ) AS subquery
WHERE
    subquery.number_of_weighins > 1
ORDER BY
    number_of_weighins

Share

For a period of 62 days (from 3/12/2016 to 5/12/2016), out of the 13 unique users that logged in their weight, how many actually had multiple entries?

[table id=1 /]

observations
  • It appears that it is not necessary for the user to enter their weight for the app to make its calculations
  • Out of the 30 users provided we find 35 unique IDs
  • One user entered their weight 44 times in 62 days.
  • 13 out of 35 unique users used the manual weigh-in feature that’s 37% which is significant over a 62 day perioData time frame and sample size are small so the validity of my findings are highly unreliable

RECOMMENDATIONS

  • UX should be the number one priority for WellaBeat. The app needs to be redesigned to include the ability to track users use of the app.
  • WellaBeat’s mission is to help women achieve a healthy lifestyle, is that actually happening? Have we asked the users? Have we collected data to backup our claims?
  • Aside from looking within and improving our existing offerings, my main recommendation is to introduce a scale that syncs with the existing app if product design, supply, and maintenance is not a deterrent.
  • The other option would be to introduce an additional service via the existing app that can link with smart scales via bluetooth and/or wifi.
  • I’d imagine tracking the weight for pre/post and during pregnancy is extremely important. It’s obvious from the limited data provided that weigh-ins were extremely popular and effective in the marketplace.
  • Why did WellaBeat stop using Spring the fluid intake tracking product? Hydration is one of the most important aspects of health and fitness – even more so during or post pregnancy? Was that a product/supply issue? That should be a focus.
  • App should track which page is viewed, how often each page is viewed, time of day viewed. After that the data will give us more insightful information.
  • Age, gender, pregnancy status needs to be tracked either via a questionnaire/survey, account/profile creation.
  • My suggestions are easy to implement and should provide more insight as to the direction marketing should proceed in the near future.