Contents
What follows are some case studies that include “Documentation of any cleaning or manipulation of data” as part of the deliverables so I’ve included all my step-by-step details.
I’ll start off with R then I’ll duplicate the process using SQL, and finally Tableau. Other analyses done in Python is not included and covered in other posts.
WELLABEAT – CS1
Founded in 2014, Wellabeat is the company that developed one of the first wearables specifically designed for women and has since gone on to create a portfolio of digital products for tracking and improving the health of women.
Focusing on creating innovative health and wellness products for women, their mission is to empower women to take control of their health by providing them with technology-driven solutions that blend design and function. Giving women the tools to reach their fullest potential through personalized wellness solutions aligned with their cycles. Here are the products that WellaBeat offers:
- Wellabeat app: The Wellabeat app provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Wellabeat app connects to their line of smart wellness products.
- Leaf: Wellabeat’s classic wellness tracker can be worn as a bracelet, necklace, or clip. The Leaf tracker connects to the Wellabeat app to track activity, sleep, and stress.
- Time: This wellness watch combines the timeless look of a classic timepiece with smart technology to track user activity, sleep, and stress. The Time watch connects to the Wellabeat app to provide you with insights into your daily wellness.
- Spring: This is a water bottle that tracks daily water intake using smart technology to ensure that you are appropriately hydrated throughout the day. The Spring bottle connects to the Wellabeat app to track your hydration levels.
- Wellabeat membership: Wellabeat also offers a subscription-based membership program for users. Membership gives users 24/7 access to fully personalized guidance on nutrition, activity, sleep, health and beauty, and mindfulness based on their lifestyle and goals.
Here are some of the metrics that the app tracks based on information found on their site:
- meditation rate
- starts tracking at “first step”
- starts tracking at “first shut eye”
- ovulation cycle, temperature, pregnancy term….
- cardiac coherence!!??
- stress tolerance, once again is based on heart rate and temperature levels?
- activity level
Purpose
- The goal of this project is to analyze Wellabeat’s provided data to reveal more opportunities for growth.
- Focus on one Wellabeat product and analyze smart device usage data in order to gain insight into how people are already using their smart non-Wellabeat devices.
- Then, using this information, make high-level recommendations for how these trends can inform Wellabeat marketing strategy.
- My report will be directed to the executive team:
- Urška Sršen: Wellabeat’s cofounder and Chief Creative Officer
- Sando Mur: Mathematician and Wellabeat’s cofounder, key member of the Wellabeat executive team
Deliverables
I will produce a report with the following deliverables:
- A clear summary of the business task.
- A description of all data sources used
- Documentation of any cleaning or manipulation of data
- A summary of my analysis
- Supporting visualizations and key findings
- Top high-level content recommendations based on the analysis
Business task
The executive team believes that analyzing smart device fitness data from other products on the market could help unlock new growth opportunities for the company. I have been asked to focus on one of Wellabeat’s products and analyze external smart device data to gain insight into how consumers are using their smart devices. The insights I discover will then help guide marketing strategy for the company.
Scope/major activity
data collection
- FitBit Fitness Tracker Data (CC0: Public Domain, dataset made available through Mobius): This Kaggle dataset contains personal fitness tracker from thirty fitbit users. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits.
- It is believed that the provided dataset has limitations.
identify one product
After reviewing WellaBeats site I realized that the information provided by the company differs from what is actually displayed on the current site. I’ll use of the following products to focus the analysis. All of the listed products provide the same data to the user, the varying features are the types of devices they are:
-
- Ivy+ – wrist wearable tracker
- Leaf Urban – could be worn as necklace, in a wrist band
- Leaf Chakra -could be pinned to a collar, blouse, placed in wrist band
create recommendations
- What are some trends in smart device usage?
- How could these trends apply to Wellabeat customers?
- How could these trends help influence Wellabeat marketing strategy?
Ask
There are many questions I have that went unanswered;
- Why are we using data from 2016?
- The data provided appears to target the fitness/workout community: steps per day, workout type, workout duration… while WellaBeat appears to be more of mental wellness for pregnant women!?
- Is the company thinking about branching into that highly competitive domain?
- Why doesn’t WellBeat provide its own data for analysis?
- Does WellBeat collect data?
- Does WellBeat know how to track the metric most used or viewed by its users?
- Does WellBeat know how to track pages in the app that are most viewed by its users?
- Why isn’t the science behind the claims described on the site more transparent? Being open with the science behind the metrics might create a loyal customer base!
- WellaBeat claims it targets women, does WellBeat know exactly how many of its users are women?
- Does WellaBeat know the age range or concentration age of its user?
- The site makes it obvious that the target market are pregnant women, does WellaBeat have any idea how many of their users are/were/will be pregnant?
Prepare
data
- Data is old: 3/2016 to 5/2016
- Covers 30 users of FitBit Fitness tracker
- Period of 62 days
- After reviewing Wellabeat’s site it appears that the data provided is geared more towards fitness rather than monitoring what Wellabeat monitors
- Workouts metric used by WellaBeat is in common with the data provided
- Activity level and calories burned are additional common metrics
- One very important metric in the data is the manual weigh-ins that’s not been tracked by WellaBeat
- The data provided shows activity levels, distance walked, calories burned…. are those manually triggered by the user or does the product continuously collect data? What I need to know is if the user actually turns on the data tracking manually or does the app track continually. I’d like to know how invested are the users. In other words if the app tracks all that information automatically, then how do we know if the user pays attention, reviews the data or makes life decisions based on what’s been tracked? In other words, is the data provided for ananlysis helping any of the users, or just metrics that the user doesn’t review. If the user does review it then how do we know how many of the users actually review the data, how helpful are these metrics to the user? Are we solving a problem? I have a nike app on my phone I used 2-3 times to track my running path, length and distance. I haven’t used it in years and yet it still tracks my every day movements! That’s data it collects that I never look at or even use the app.
- Data doesn’t mention the gender, race, or age of the 30 users
- Activity level measured in the app? how is that actually calculated? is the app linked to an exercise device?
- Does the app track the distance traveled by the phone and if the movement is greater than a specific speed it assumes it’s a hight intensity workout?! ….
- With all these questions, it’s obvious that there is a clear bias in the data
PROCESS – R
Note: I will be using R, SQL, and Tableau for processing and analysis, most of the same steps done in R will be repeated in SQL
Cleaning with R
read data
#______________________________INSTALL PACKAGES install.packages("tidyverse") install.packages("skimr") install.packages("janitor") install.packages("ggplot2") #______________________________LOAD PACKAGES library(tidyverse) library(skimr) library(janitor) library(ggplot2) library(lubridate) library(stringr) #______________________READ DAILY ACTIVITY DATA FROM MONTHS 3 TO 4 OF 2016 dailyactivity_df_3_4 <- read_csv("dailyActivity_merged_3_4.csv") head(dailyactivity_df_3_4) #_______________________FILTER TO RELEVANT COLUMNS activity_3_4 <- dailyactivity_df_3_4 %>% select(Id,ActivityDate, TotalSteps, TotalDistance, Calories) #________________________CONVERT FROM <char> TO <date> AND GROUP BY Id edited_activity_3_4 <- activity_3_4 %>% mutate(mDate = mdy (ActivityDate)) %>% group_by(Id) %>% arrange(mDate, .by_group = TRUE) View(edited_activity_3_4) OUTPUT # TOTAL 457 ROWS
observations
The obvious path would be to visualize the correlation between all the metrics collected in the data such as calories burned vs workout type, workout length, workout load type… but in my opinion that would be a waste of time and is irrelevant to our business task. It’s been known for centuries what the relationship is between working out, workout load, time under stress, and calories burned.
Each variable contributes directly to calories burned, we are not doctors nor would 30 users monitored for 62 days change the science, so I will not spend time on the majority of the data provided.
My focus will be on UX
- Does the product provide a solution to the users?
- Is there a benefit to the user for using the product?
- There is one metric that’s very important and displays active interaction with the users and that’s manual weigh-in
- For the user to take the time of getting weighed, and then manually enter the information in the app is extremely important. It’s possible that the app used to collect the data could be linked to a bluetooth or wifi enabled scale and automatically logs in the weight once a value is read!?
- That would be a great service/device that could be a focus for WellaBeat.
read data
#_____________________READ WEIGHT LOGIN FROM 3 TO 4 OF 2016 weightloginfor_df_3_4 <- read_csv("weightLogInfo_merged_3_4.csv") #Convert the Date column from <chr> to <date> and groupby ID just like we did earlier #it appears that most of the Dates are displayed as 11:59:59 PM so we have to assume that maybe the system #purges at midnight so let's edit the date to drop the hms. #It appears that when we drop the hms we get NA so we'll keep it as #mdy_hms and convert from date-time to date with as_date() after we convert the <chr> to datetime edited_weighin_3_4 <- weightloginfor_df_3_4 %>% select(Id, Date, WeightPounds, BMI) %>% mutate(mDate = as_date(mdy_hms (Date))) %>% group_by(Id) %>% arrange(mDate, .by_group = TRUE) View(edited_weighin_3_4) OUTPUT # TOTAL 33 ROWS
observations
- What’s interesting is that 33 out of 30 users logged in to the manual weight-in page for March. Do we assume some are duplicates?
- Maybe they created multiple accounts?
- Is it a requirement for all users to login to the weight-in page before the app starts tracking and making calculations?
- Are the calories burned calculations based on weight of the user?
- Is it possible that once the user installs the app and create an account it automatically shows that they logged in to the weigh-in page?
- I need to dive into this data more and see how many of the users may have entered their weight and how many just logged in and never entered their weight.
- Is this data from newly enrolled users? or
- Was the data extracted from a random timeframe?
merge data
#___________________________MERGE DAILY ACTIVITY AND WEIGHT-IN FOR MARCH _______________________ merged_3_4 <- left_join (edited_activity_3_4, edited_weighin_3_4, by =c("Id","mDate")) View(merged_3_4) OUTPUT # TOTAL 457 ROWS #_____________________REPEAT ABOVE PROCESSES FOR DAILY ACTIVITY DATA FROM 4 TO 5 OF 2016_________ dailyactivity_df_4_5 <- read_csv("dailyActivity_merged_4_5.csv") #_____________________FILTER DATASET activity_4_5 <- dailyactivity_df_4_5 %>% select(Id,ActivityDate, TotalSteps, TotalDistance, Calories) #_____________________CONVERT DATE COLUMN edited_activity_4_5 <- activity_4_5 %>% mutate(mDate = mdy (ActivityDate)) %>% group_by(Id) %>% arrange(mDate, .by_group = TRUE) View(edited_activity_4_5) OUTPUT # TOTAL OF 940 ROWS #_____________________READ WEIGHT LOGIN FROM 4 TO 5 OF 2016_________ weightloginfor_df_4_5 <- read_csv("weightLogInfo_merged_4_5.csv") #_____________________FILTER DATASET AND CONVERT TOGETHER edited_weighin_4_5 <- weightloginfor_df_4_5 %>% select(Id, Date, WeightPounds, BMI) %>% mutate(mDate = as_date(mdy_hms (Date))) %>% group_by(Id) %>% arrange(mDate, .by_group = TRUE) View(edited_weighin_4_5) OUTPUT # TOTAL OF 67 ROWS #______________________MERGE BOTH APRIL DATASETS merged_4_5 <- left_join (edited_activity_4_5, edited_weighin_4_5, by =c("Id","mDate")) View(merged_4_5) OUTPUT # TOTAL OF 940 ROWS #______________________BIND THE TWO MERGED TO YIELD FILTERED ENTIRE DATASETS # SHOULD HAVE 457 + 940 = 1397 ROWS merged_data_3_5 <- rbind(merged_3_4, merged_4_5) #____________________REVIEW STRUCTURE OF ALL COLUMNS str(merged_data_3_5) #______________________FILTER OUT NOT RELEVANT COLUMNS data_3_5 <- merged_data_3_5 %>% select(Id, mDate, TotalSteps, TotalDistance, Calories, WeightPounds, BMI) #_____________________SAVE AS NEW DATASET _____________________ file.create("bellabeat_cleaned_filtered_3_5.csv") write_csv(data_3_5,"bellabeat_cleaned_filtered_3_5.csv")
ANALYZE – R
Analyze with R
logins
Let’s see how many users actually manually logged in to to the weight app:
#__________________LET'S SEE HOW MANY USERS LOGGED IN TO THE WEIGHT APP logins <- data_3_5 %>% group_by(Id) %>% summarize(number_of_logins = n()) %>% arrange(desc(Id)) We get 35 users that logged in to the weight-in app. The table is below with the weighins table
weighins
#_______________LET'S SEE HOW MANY OF THE USERS ACTUALLY ENTERED A WEIGHT (WEGHINS) weighins <- data_3_5 %>% ungroup() %>% filter(WeightPounds != "NA") %>% group_by(Id) %>% summarize(number_of_weighins = n()) %>% arrange(desc(Id)) As you can see only 13 out of the 35 users actually entered their weight in the app.
logged & weighed
#________________LET'S JOIN THE TWO TOGETHER TO SEE HOW MANY LOGGED-IN AND WEIGHED-IN login_weighin <- left_join (logins, weighins, by = ("Id")) #___________I COULD USE SLICE() TO FILTER OUT THE RELEVANT ROWS BUT THE TABLE IS SMALL ENOUGH TO READ
NOTE: it is extremely important to not join logins and weighins after you execute gt() above. So perform the join and get table printed out on the join then go back and gt() the previous two tables if you wish to output the prior two into gt(). If you do you’ll get this error:
Error in UseMethod("left_join") :
no applicable method for 'left_join' applied to an object of class "c('gt_tbl', 'list')"
observations
The data is promising
- out of 30 users that were tracked (the data show 35 users which might be from duplicate accounts?!) so out of the 30 users 13 actually logged in to the weight app on a regular basis as is displayed in the number of logins column
- Over a 60 day period you can see in column 2 that the users logged in on an extremely consistent basis, 30, 40 and over 50 times
- 13 users entered their weight in the app. Once again, I really have no way of knowing if the weight-in side of the app is any different from the main app, but I have to assume that it is because the login information is included in the weight-in data not in the daily activity data.
- 6 out of 13 entered their weight in 1 time over a 60 day period. I need to follow up on these users to inspect their activity levels over the time tracked. Maybe their lone entry could be explained by their acitivity level.
- 5 of of the 13 entered their weight multiple times. Again over a 60 day period you really don’t expect much change in weight unless you are trying to actually gain or lose it. So once again we need to investigate their activity level to see if there is a correlation between multiple entry and activity, maybe they started or finished on a program?
- 1 user had 33 and another user had 44 weigh-ins. That seems to be extreme as to someone is either on a hardcore weight loss program or maybe pregnant or possibly post pregnancy. Once again activity level needs to be investigated.
- Overall 13 out of 30 users is a great percentage and that appears to be a function that provides a great UX and needs to be added to BellaBeat.
ANALYZE – SQL
Analyze with SQL
unique Ids
Let’s look at how many unique Ids we have, we were told that the data was provided on 30 users. The data generated is saved in file: unique_Id_total_35.csv
NOTE: I’m using a condensed code spacing and indentation to save on screen size
SELECT COUNT(DISTINCT(Id))
FROM `prime-depot-415622.bellabeat.bellat_merged`
The code gives us 35.
weightLogIn entries
Let’s see how many entries in the weightLogInfo dataset were not NA for both columns BMI & Weightkg/pounds. In other words how many times did the user enter a value for either BMI or Weight out of the 1397 total entries. Data is saved in file: weight_or_bmi_entries_100.csv.
SELECT Id, mDate
FROM `prime-depot-415622.bellabeat.bellat_merged`
WHERE BMI <> "NA" AND WeightPounds <> "NA"
The answer is 100 rows.
weightLogIn users
Let’s see how many of those weightLogIn entries belonged to unique users. In other words how many users actually logged in their weight. Data is saved in file: users_that_weighedin_13.csv
SELECT Id, COUNT(*) AS number_of_weighins
FROM `prime-depot-415622.bellabeat.bellat_merged`
WHERE BMI <> "NA" AND WeightPounds <> "NA"
GROUP BY Id
ORDER BY number_of_weighins
The answer is 13. So out of the 35 unique users 13 users actually manually?! entered their weight. That's 37%.
multiple entries
Let’s see how many users had multiple entries of their weight:
A brief explanation of the code below:
- Start from the subquery within (colored in blue) : Select the Id column, and then COUNT how many times each ID which is GROUPED BY has loggedin their weight.
- We can find that out by filtering the dataset with the WHERE clause that only looks at when values for BMI AND WeightPounds were entered and NOT NULL or =”NA”
- Then we group each entry by ID and count them and
- Name the subquery AS subquery
- Now let’s step back out to the outer query and SELECT what we just did in the subquery and pull the ID and
- Assign the count of the number_of_weighins to multiple_logins because we will use a
- WHERE clause to just display the times a user logged in multiple times to enter their weight
- That value comes up to 7
- Breakdown of the multiple logins are as folllows:
SELECT
subquery.Id,
subquery.number_of_weighins AS multiple_logins
FROM
(
SELECT
Id,
COUNT(*) AS number_of_weighins
FROM
`prime-depot-415622.bellabeat.bellat_merged`
WHERE
BMI <> "NA" AND WeightPounds <> "NA"
GROUP BY
Id
) AS subquery
WHERE
subquery.number_of_weighins > 1
ORDER BY
number_of_weighins
Share
For a period of 62 days (from 3/12/2016 to 5/12/2016), out of the 13 unique users that logged in their weight, how many actually had multiple entries?
[table id=4 /]
observations
- It appears that it is not necessary for the user to enter their weight for the app to make its calculations
- Out of the 30 users provided we find 35 unique IDs
- One user entered their weight 44 times in 62 days.
- 13 out of 35 unique users used the manual weigh-in feature that’s 37% which is significant over a 62 day perioData time frame and sample size are small so the validity of my findings are highly unreliable
RECOMMENDATIONS
- UX should be the number one priority for WellaBeat. The app needs to be redesigned to include the ability to track users use of the app.
- WellaBeat’s mission is to help women achieve a healthy lifestyle, is that actually happening? Have we asked the users? Have we collected data to backup our claims?
- Aside from looking within and improving our existing offerings, my main recommendation is to introduce a scale that syncs with the existing app if product design, supply, and maintenance is not a deterrent.
- The other option would be to introduce an additional service via the existing app that can link with smart scales via bluetooth and/or wifi.
- I’d imagine tracking the weight for pre/post and during pregnancy is extremely important. It’s obvious from the limited data provided that weigh-ins were extremely popular and effective in the marketplace.
- Why did WellaBeat stop using Spring the fluid intake tracking product? Hydration is one of the most important aspects of health and fitness – even more so during or post pregnancy? Was that a product/supply issue? That should be a focus.
- App should track which page is viewed, how often each page is viewed, time of day viewed. After that the data will give us more insightful information.
- Age, gender, pregnancy status needs to be tracked either via a questionnaire/survey, account/profile creation.
- My suggestions are easy to implement and should provide more insight as to the direction marketing should proceed in the near future.
BIKESHARE – CS2
A bike-share program that features more than 5,800 bicycles and 600 docking stations. Bikeshare sets itself apart by also offering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who can’t use a standard two-wheeled bike. The majority of riders opt for traditional bikes; about 8% of riders use the assistive options. Bikeshare users are more likely to ride for leisure, but about 30% use the bikes to commute to work each day.
In 2016, Bikeshare launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime. Until now, Bikeshare’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans:
- single-ride passes
- full-day passes and
- annual memberships
Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are members. Bikeshare’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Bikeshare tract more customers, Moreno believes that maximizing the number of annual members will be key to future growth.
Rather than creating a marketing campaign that targets all-new customers, Moreno believes there is a solid opportunity to convert casual riders into members. She notes that casual riders are already aware of the Bikeshare program and have chosen Bikeshare for their mobility needs.
Purpose
The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Bikeshare bikes differently.
From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Bikeshare executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.
Moreno has set a clear goal:
Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the team needs to know
- How do annual members and casual riders use bikes differently?
- Why would casual riders buy annual memberships?
- How can Bikeshare use digital media to influence casual riders to become members?
- Moreno and her team are interested in analyzing the historical bike trip data to identify trends.
Stakeholders
- Lily Moreno: The director of marketing and your manager. Moreno is responsible for the development of campaigns and initiatives to promote the bike-share program.
These may include email, social media, and other channels. - Bikeshare marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide the marketing strategy.
You joined this team six months ago and have been busy learning about Bikeshare’s mission and business goals—as well as how you, as a junior data analyst, can help Bikeshare achieve them. - The executive team: The notoriously detail-oriented executive team will decide whether to approve the recommended marketing program.
Deliverables
1. A clear statement of the business task
2. A description of all data sources used
3. Documentation of any cleaning or manipulation of data
4. A summary of your analysis
5. Supporting visualizations and key findings
6. Your top three recommendations based on your analysis
Prepare
data
Bikeshare’s historical trip data is found here: Downloadable here
The data has been made available by Motivate International Inc. under this license.) This is public data that you can use to explore how different customer types are using Bikeshare bikes. But note that data-privacy issues prohibit you from using riders’ personally identifiable information.
Note: if using Posit’s RStudio, use the Divvy 2019 Q1 and Divvy 2020 Q1 datasets. Choosing other data might lead to errors because the data exceeds the memory available in the free plan.
load data
#______________________________INSTALL PACKAGES
install.packages("tidyverse")
install.packages("skimr")
install.packages("janitor")
install.packages("ggplot2")
install.packages("readr")
install.packages("readxl")
install.packages("gt")
install.packages("webshot2")
#______________________________LOAD LIBRARIES
library(tidyverse)
library(skimr)
library(janitor)
library(ggplot2)
library(lubridate)
library(stringr)
library(readr)
library(readxl)
library(dplyr)
library(gt)
library(webshot2)
#______________________READ DATA FROM Q1 2019 AND 2020
trips19 <- read_excel("BikeShare/BikeShareOriginalData/Divvy_Trips_2019_Q1.xlsx")
trips20 <- read_excel("BikeShare/BikeShareOriginalData/Divvy_Trips_2020_Q1.xlsx")
observations
- The two plans offered to casual riders: single-ride passes and full-day passes are not tracked in the data provided all is tracked is whether the user is “Customer” or “Subscriber”.
- The company claims that “Cyclistic users are more likely to ride for leisure, but about 30% use the bikes to commute to work each day”, how is that numbered calculated? nothing in the data shows how it can possibly be calculated!
- UserId is not tracked!? How are we to guess how many users are using the system? This is basic data collection issue which makes the data broad and almost useless.
- Birthyear and gender data is collected which could become helpful if we can figure out how many users we have.
- Up to this point I pretty much know which direction I have to shift my analysis based on what’s provided in the data. This is not bias by any means, it is the simple understanding that the company lacks the knowledge of collecting the right data. It’s like trying to estimate the mpg for a car based on data collected on wiper fluid level.
- At this point I’d have to take my concerns to my manager and advise them of the limitations of the data collected. I will also need to bring up the fact that her hypothesis ” Moreno believes there is a solid opportunity to convert casual riders into members” will be impossible to be justified with this data since we don’t know the exact number of casual and member users.
- Either additional data needs to be provided or the system needs to be adjusted to collect the userId.
- Since the data is provided for Customers and Subscibers as a whole and not broken down to each user I’ll have to see if any
- Column naming took a significant change in 2020 compared to 2019:
- trip_id>ride_id, start_time>started_at, end_time>ended_at,bike_id & tripduration & gender & birthyear were all removed, from_station_id & from_station_name>start_station_name & start_station_id, to_station_id & to_station_name > end_station_id & end_station_name, usertype > member_casual
- end_lat & end_lng & start_lat & start_lng were added
PROCESS – R
I usually detail my steps in “documents” for record keeping, in this project “Documentation of any cleaning or manipulation of data” is part of the deliverables so I’ve included it below.
Note: I will be using R, SQL, and Tableau for processing and analysis, most of the same steps done in R will be repeated in SQL
Cleaning with R
rename
#_____________________RENAME 2019 COLUMNS TO MATCH 2020
trips19 <- rename( trips19,
ride_id = trip_id,
started_at = start_time,
ended_at = end_time,
rideable_type = bikeid,
start_station_name = from_station_name,
start_station_id = from_station_id,
end_station_name = to_station_name,
end_station_id = to_station_id,
member_casual = usertype)
change type
trips19 has ride_id as “number” while trips20 has it as “chr”, so let’s change trips19 to “char” so we can merge/bind both datasets together without any issues!
#____________________CONVERT ride_id & rideable_type FROM num TO char
trips19 <- mutate (trips19,
ride_id = as.character(ride_id),
rideable_type = as.character(rideable_type))
vertical merge datasets
rbind() is similar but REQUIRES equal number and identical columns for all datasets. So we use bind_rows() because trips19 has unique columns as well as common columns with trips20: tripduration, gender & birthyear that aren’t in trips20. On the other hand, trips20 has columns: start_lat, start_lng, end_lat, end_lng that don’t exist in trips19
#___________________COMBINE BOTH DATASETS VERTICALLY VIA bind_rows()
trips19_20 <- bind_rows(trips19, trips20)
filter/Exclude Columns
#__________________FILTER NEW DATASETS PRIOR TO MANIPULATION - EXCLUDE CERTAIN COLUMNS trim_trips19_20 <- subset(trips19_20, select = -c( gender, birthyear)) #__________________OR YOU CAN DO IT THIS WAY AS WELL trim_trips19_20 <- trips19_20 %>% select(-c( gender, birthyear))
replace with case_when()
member_casual column had values in the 19 dataset that we need to update to match the values we’ll use from 2020 and forward. We can use case_when or case_match. recode() has been replaced with case_match:
#_________________RENAME member_casual COLUMN FROM member & casual TO Subscriber & Customer
all_trips19_20 <- trim_trips19_20 %>%
mutate(member_casual = case_when(
member_casual == "member" ~ "Subscriber",
member_casual == "casual" ~ "Customer",
member_casual == "Subscriber" ~ "Subscriber",
TRUE ~ "Customer"
))
table() to verify
Just to make sure we ended up with ONLY two desired values and we didn’t miss anything we can use the colde below to check how many values exist in column member_casual:
#_________________CHECK TO SEE IF ALL WE HAVE ARE THE TWO DESIRED VALUES table(all_trips19_20$member_casual) OUTPUT Customer Subscriber 71643 720313
Observations
- My prior observations regarding the data is validated, as we can only gather the total number of rides taken by “Customers” vs “Subscribers”.
- We can break down each type per station, per day, per hour, per gender, per age, but it’s impossible to know how many actual “riders” we are dealing with. How are we supposed to know how many “Customer” users are there?
- How can we even imagine an entire marketing campaign in an attempt to convert users that we can’t quantify?
- The data shows that 9.95% of the trips are taken by Customers!
- Once again, not knowing how many acutal users derived all these rides it is NOT ADVISABLE to spend ANY resources trying to analyze nor devise a marketing strategy or campaign to convert any of the 9.95% of our rides.
- My recommendation to my manager that we stop this analysis at this point until additional data is provided, or collection methods are revised.
calculate ride_length
difftime yields a result in drtn which looks like this in df: 390 secs, 441 secs,…. so I used as.numeric to convert it to a num which drops the secs string as well
#____________________CALCULATE ride_length AND DROP tripduration
all_trips19_20 <- all_trips19_20 %>%
mutate(ride_length = as.numeric(difftime(ended_at, started_at, units = 'secs')))
drop column
#____________________DROP tripduration COLUMN
all_trips19_20 <- all_trips19_20 %>%
select(-c(tripduration))
check for ZERO values
Let’s check to see what kind of values are in ride_length
#___________________CHECK FOR VALUES==0 IN ride_length any(all_trips19_20$ride_length == 0) [1] TRUE
check for <0 values
#___________________CHECK FOR NEGATIVE VALUES IN ride_length any(all_trips19_20$ride_length < 0) [1] TRUE
#___________________ANOTHER WAY IS USING ALL all(all_trips19_20$ride_length > 0) [1] FALSE
check for NA values
#___________________CHECK FOR NA VALUES IN ride_length any(is.na(all_trips19_20$ride_length)) [1] FALSE
#__________________OF COURSE WE CAN CHECK FOR MISSING DATA any(!is.na(all_trips19_20$ride_length)) [1] TRUE #__________________OF COURSE WE CAN CHECK FOR NON-MISSING DATA < 0 or == 0 any(!is.na(all_trips19_20$ride_length) < 0) [1] TRUE
count trips of 0 length
#___________________COUNT HOW MANY ride_length VALUES == 0 sum((all_trips19_20$ride_length == 0)) [1] 93
count <0 values
#___________________COUNT HOW MANY NEGATIVE ride_length VALUES WE HAVE sum((all_trips19_20$ride_length < 0)) [1] 117
identify <0 rows
You could install another package if you want to identify the rows that contain the negative ride_length.
#______________________IDENTIFY ROWS WITH NEGATIVE ride_length install.packages("fixr") library(fixr) check_for_negative_values(all_trips19_20$ride_length) Data frame contains negative values. [1] 520402 648667 651963 652290 653845 655288 655654 656362 657167 658842 [19] 669752 678262 678499 678889 679182 679252 680301 680757 682576 682662 [37] 694476 694635 696693 697774 698938 699115 699204 699450 699619 699855 [55] 714612 714687 714761 715171 715474 719275 719680 725559 726496 730499 [73] 742697 743604 743619 745898 747194 747233 748686 754006 754853 755032 [91] 763358 766631 767985 768102 768693 768833 772284 772868 774788 775777 [109] 779170 779496 781519 781699 786503 786687 788671 788946 789252
As you see if we specifically target the ride_length column we get a list of all the rows that contain a negative in that specific column
identify ALL <0 values
Let’s see if any other NEGATIVE values exist in the entire dataset. The output displays all the rows where a NEGATIVE value appears in a specific column. As you can see we have 117 rows, and the only column with negative values is column 10 which happens to be the ride_length column.
#_____________________IDENTIFY ANY NEGATIVE VALUES IN DF check_for_negative_values(all_trips19_20) Data frame contains negative values. row col [1,] 520402 10 [2,] 648667 10 [3,] 651963 10 [4,] 652290 10 ... [115,] 788671 10 [116,] 788946 10 [117,] 789252 10
extract ride_length <0 rows
Let’s see what stands out in these rows with NEGATIVE ride_length values. So let’s create a new df with just those rows
#_____________________LET'S CREATE A DF WITH JUST THE NEGATIVE ride_length FOR REVIEW
neg_length <- all_trips19_20[all_trips19_20$ride_length < 0, ]
View(neg_length)
the new df shows 117 rows and:
- all start_station_name and end_station_name == HQ QR
- of course all start_station_id and end_station_id are all the same == 675
- After further research I realize that these were bikes taken out of circulation for quality control, maintenance…
- So we need to remove these rows from the dataset
extract ride_length = 0 rows
#_____________________LET'S CREATE A DF WITH JUST THE ZERO ride_length FOR REVIEW
zero_length <- all_trips19_20[all_trips19_20$ride_length == 0, ]
View(zero_length)
Similar to the NEGATIVE value rows all observations listed above apply here as well, so we’ll be removing these rows from the dataset as well. There are 93 rows in this df.
check df dimensions
Before we delete any rows, let’s check the dimensions of the df
#_____________________CHECK DF DIMENSIONS dim(all_trips19_20) [1] 791956 10
delete <= 0 rows
I will not delete any rows, instead I’ll create a new df excluding the rows with <=0 length, or if you want to go further and include another logical condition OR | where station names are = HQ QR you can be my guest. I will not do it, instead I will check the new df afterwards for station names == HQ QR and deal with that at the time if they exist. I believe including that condition in this step might cause results I am not ready for yet. We know for sure we only have 117+93= 210 rows so let’s focus on that now.
#____________________CREATE A NEW DF WITH POSITIVE ride_length all_trips19_20_v2 <- all_trips19_20 [all_trips19_20$ride_length > 0,] dim(all_trips19_20_v2) [1] 791746 10
extract “HQ QR” rows
Now that we’ve taken out all rows that had a ride_length <= 0, let’s go back and see if there are any rows where start_station_name or end_station_name columns had values of “HQ QR”.
#____________________LET'S CHECK TO SEE IF "HQ QR" APPEARS IN ANY *_station_name COLUMNS
hq_anywhere <- all_trips19_20_v2 %>%
filter_all(all_vars(start_station_name == "HQ QR" | end_station_name == "HQ QR"))
View(hq_anywhere)
It appears that 3558 rows have HQ QR and their ride_length is >0. So we need to remove them as well as anything <=0
delete rows with “HQ QR”
#_________________REMOVE ROWS THAT CONTAIN "HQ QR" all_trips19_20_v3 <- all_trips19_20_v2 [!(all_trips19_20_v2$start_station_name == "HQ QR" | all_trips19_20_v2$end_station_name == "HQ QR"),] dim(all_trips19_20_v3) [1] 788188 10
add timeframes
The way the data is setup it would help down the road if we break down the started_at time into more detailed columns, such as Year, Quarter, Month, Day of the month, and day of the week. Reason being is when we perform our analysis we could look for patterns for weekdays, weekends, specific months or quarters.
#_______________LET'S BREAK started_at INTO >DATE,YEAR,QUARTER,MONTH(NUM),DAY(NUM),WEEKDAY(STRING) #_______________ADD TIMEFRAMES all_trips19_20_v3 <- all_trips19_20_v3 %>% mutate( date = format(as.Date(started_at), format = "%m%d%Y"), #monthdayYYYY year = format(as.Date(started_at), format = "%Y"), #Y > 1111 and y > 11 quarter = quarters(started_at), #quarter month = format(as.Date(started_at), format = "%m"), #months in number day = day(started_at), #gives the day of the month in # day_of_week = weekdays(started_at), #text for the day )
unique start stations
#__________________________LET'S SEE HOW MANY UNIQUE STARTING STATIONS WE HAVE start_id <- all_trips19_20_v3 %>% #filter(year == 2020) %>% distinct(start_station_id) %>% arrange(start_station_id) %>% rename(stationID = start_station_id) #611 ROWS if we filter out 2019 we get 606 rows, so 5 stations were removed in 2020
unique end stations
#__________________________LET'S SEE HOW MANY UNIQUE ENDING STATIONS WE HAVE end_id <-all_trips19_20_v3 %>% #filter(year == 2020) %>% distinct(end_station_id) %>% arrange(end_station_id) %>% rename(stationID = end_station_id) #611 ROWS
compare stations
#_________________________LET'S SEE IF BOTH STATION LISTS ARE THE SAME identical(start_id$start_station_id, end_id$end_station_id) #FALSE! all.equal(start_id, end_id) #"Names: 1 string mismatch" "Component 1: Mean relative difference: 0.002392677" glimpse(end_id) #both are dbl summary(compare_df_cols_same(start_id, end_id)) summary(compare_df_cols(start_id, end_id))
outer join lists
#________________OUTER JOIN THEM BECAUSE WE WANT ALL DISTINCT STATION ID FROM BOTH join_station_id <- merge(start_id, end_id, by = 'stationID', all = TRUE) #613 ROWS
filter lists
#___________________FILTER OUT STARTING & ENDING STATION WITHOUT COORDINATES station_coord <- all_trips19_20_v3 %>% select( start_station_id,start_lat,start_lng) %>% filter(!is.na(start_lat), !is.na(start_lng)) %>% group_by(start_station_id, start_lat, start_lng) %>% # summarize(number_of_rides = n()) %>% arrange((start_station_id)) %>% rename(stationID = start_station_id) #606 ROWS it is less because some trips didn't have geo coordinates
uncommon stations
#____________________LETS SEE WHAT WE HAVE UNCOMMON BETWEEN THE TWO anti_join(join_station_id, station_coord) # and here below we get the 7 rows in stationID that are not in station_coord # anti_join(join_station_id, station_coord) # Joining with `by = join_by(stationID)` # stationID # 1 360 # 2 361 # 3 561 # 4 564 # 5 565 # 6 566 # 7 665
review uncommon ids
#____________________LET'S EXTRACT THOSE STATIONS FOR REVIEW uncommon <- all_trips19_20_v3[all_trips19_20_v3$start_station_id == 360 | all_trips19_20_v3$start_station_id == 361 | all_trips19_20_v3$start_station_id == 561| all_trips19_20_v3$start_station_id == 564| all_trips19_20_v3$start_station_id == 565| all_trips19_20_v3$start_station_id == 566| all_trips19_20_v3$start_station_id == 665, ] #OUTPUT # 40 ROWS of trips without geolocations, all from 2019. # Those stations are no longer active in 2020 so we can ignore so we'll proceed with an inner join # which will eliminate the uncommon ones
inner join lists
Here I’ll take the cleaned station list and join them with the stations with coordinates. Since some in the station list are no longer in use, using inner join will automatically eliminate those unwanted stations from the join_station_id list.
#_____________________________INNER JOIN TO COMBINE COORDINATES WITH ACTIVE STATIONS all_active_stations_coord <- inner_join(join_station_id, station_coord, by ='stationID') #____________________________DROP NUMBER OF RIDES COLUMN THAT WAS USED TO GROUP THEM all_active_stations_coord <- all_active_stations_coord %>% select(-c(number_of_rides))
Save data
save cleaned data
Now we have two cleaned files, one with all the pertinent data we need and the other with the coordinates to every station in the first file.
#__________________________________SAVE CLEANED DATA________________________
file.create("rcleaned_bikeshare_q1_19_20.csv")
write_csv(all_trips19_20_v3,"rcleaned_bikeshare_q1_19_20.csv")
file.create("rcleaned_bikeshare_geolocation_20.csv")
write_csv(all_active_stations_coord,"rcleaned_bikeshare_geolocation_20.csv")
ANALYZE – R
Analyze with R
count rides/user
#______________________________LET'S SEE THE COUNT FOR EACH USER TYPE table(all_trips19_20_v3$member_casual Customer Subscriber 67877 720311 The total of the two user types add up to 788188 which is what we had a couple sections above. #_____________________________COULD BE DONE THESE OTHER WAYS all_trips19_20_v3 %>% group_by(member_casual) %>% summarize(number_of_rides = n()) %>% gt() %>% gtsave( "total_per_user.png") #___________________________________OR THIS WAY all_trips19_20_v3 %>% count(all_trips19_20_v3$member_casual) %>% gt()
observation
- I recommend tracking both with IDs so we can narrow in on how many users are more active than others so we can gather some insight on their use
- If users are tracked via ID and zip code, then we’ll know how many of them are business commuters, casual users, or are actual tourists and just use it while in town. This way we can have a more focused marketing campaign.
summary
#________________MAX, MIN, MEAN, MEDIAN FOR ride_length min_trip_hours = round(min(all_trips19_20_v3$ ride_length)/3600, 2) max_trip_minutes = round(max(all_trips19_20_v3$ ride_length)/60,2) max_trip_hours = round(max(all_trips19_20_v3$ ride_length)/3600, 2) median(all_trips19_20_v3$ ride_length) mean(all_trips19_20_v3$ ride_length) #_______________________INSTEAD I PREFER TO DO IT ALL IN ONE STATEMENT summary(all_trips19_20_v3$ride_length ) > summary(all_trips19_20_v3$ride_length) Min. 1st Qu. Median Mean 3rd Qu. Max. 1 331 539 1189 912 10632022
summarise
We’ll get a better idea of the breakdown if we group them by type of user and then calculate the statistics for each category
#______________________GROUP BY TYPE OF RIDER THEN CALCULATE STATISTICS all_trips19_20_v3 %>% group_by(member_casual) %>% summarise( max = max(ride_length), min = min(ride_length), median = median(ride_length), mean = mean(ride_length) ) %>% gt()
rides per year
#_________________LOAD LIBRARY SCALES #_______________TOTAL TRIPS FROM YEAR TO YEAR library(scales) trips_year <- trips_19_20 %>% group_by(year) %>% summarise(number_of_rides =n()) %>% mutate(pct_change = percent((number_of_rides - lag(number_of_rides))/lag(number_of_rides))) %>% gt() %>% tab_header(title =md("**Daily breakdown for rides/year** ")) gtsave( trips_year, "trips_year.png")
rides per year/user type
#___________________YEARLY CHANGE PER USER TYPE
user_year <- trips_19_20 %>%
group_by(member_casual, year) %>%
summarise(number_of_rides =n()) %>%
mutate(pct_change = percent((number_of_rides - lag(number_of_rides))/lag(number_of_rides))) %>%
gt() %>%
tab_header(title =md("**Rides/year per user type** "))
gtsave( user_year, "user_per_year.png")
observation
- The table above clearly shows that marketing has done a great job in attracting new Customers with a 93% increase.
- Actually the Subscriber increase is very good as well
- Once again the disparity between the total of rides by each user type is glaring and looking at the previous table that showed a 16% in total rides from 2019 to 2020, we can conclude that the 93% increase only contributed to 5% of the total rides.
- It is imperative that we tie the number of users to the number of rides (which is not done at this time) to gain any meaningful insight from our analysis
total rides per day
Let’s see the breakdown of total rides per day per user type (member_casual)
#______________USE TABLE FOR BREAKDOWN TOTAL RIDES/DAY/USERTYPE daily_table <- trips_19_20 %>% group_by(member_casual, week_day) %>% summarise(max = max(ride_length), min = min(ride_length), median = median(ride_length), mean = mean(ride_length), number_of_rides =n()) %>% #__________________DESIGN TABLE OUTPUT gt() %>% tab_header( title =md("**Daily breakdown for both _user types_** ") ) %>% fmt_number(columns = c(median,mean), decimals = 0) %>% gtsave(daily_table, "daily_bkdwn.png")
condensed table
#________________CONDENSE TABLE FOR EASY READ
trips_19_20 %>%
group_by(member_casual, week_day) %>%
summarise(number_of_rides =n()) %>%
gt() %>%
tab_header(title =md("**Daily breakdown for both _user types_** "))
total rides per day/year
#_______________________LET'S SEE HOW TRIPS CHANGED PER DAY FROM YEAR TO YEAR
daily_yearly <- trips_19_20 %>%
group_by(member_casual,year, week_day) %>%
summarise(number_of_rides =n()) %>%
gt() %>%
tab_header(title =md("**Daily breakdown for both _user types year vs year** "))
gtsave( daily_yearly, "daily_yearly.png")
percent change rides/day
#_________________LET'S SEE PERCENTAGE CHANGE FROM YEAR TO YEAR PER DAY
daily_percent <- trips_19_20 %>%
group_by(member_casual, week_day, year) %>%
summarise(number_of_rides =n()) %>%
mutate(pct_change = percent((number_of_rides - lag(number_of_rides))/lag(number_of_rides))) %>%
gt() %>%
tab_header(title =md("**Daily percent change for both _user types year vs year** "))
gtsave( daily_percent, "daily_percent.png")
observations
FYI day 1 = Sunday, day 2 = Monday and so on
Customers user type:
- Customer users are more active on Saturdays and Sundays which would cause one to believe that those could be tourists or visitors from one of the suburbs of Chicago.
- As I noted before, tracking zip code for users or userId with a profile would quickly focus and clarify our path to a more insightful analyis
- The most obvious percent increase is on day 1 which is Sunday
- Saturday only had a 25% increase but yet still has the second most rides second to Sunday
- I am not sure how you can convert weekend users to Subscribers if they are just visitors that might travel to the city on occasions, not knowing the pricing of a daily pass compared to a yearly subscription I need more information to be able to devise a logical effective marketing strategy for their conversion.
- We also don’t know if any of the users are repeat users or all these are new cyclists that are exposed to our bikes.
- If we are generating unique users it will be worth marketing to them with a different plan than the regular commuters that comprise the weekday usage as shown by the Subcriber type.
Subscriber user type:
- Regardless of which year you review, the weekday use of this type of user overwhelms the weekend numbers
- Thursday and Fridays showed a decline in rides but those numbers are still much higher than weekend numbers
- Cause of these declines could be lack of bikes to use, how do we know how many bikes are at each station?
- How do we know if a user proceeds to their regular station to pick up a bike and realizes none are available?
- Does the app show bikes available at each station at all times?
- Do we track how many times the user had to go to an alternate station due to the lack of availability of bikes?
- Those are numbers that can easily be tracked with a more user friendly app.
- Once again it comes down to UX, how do we know the frustration level of a user if they have to walk several blocks to another station, or worse yet take a cab or walk instead?
busiest stations
#________________STATIONS WITH THE LARGEST NUMBER OF RIDES STARTED
top_stations <- trips_19_20 %>%
group_by(start_station_id) %>%
summarise(number_of_rides =n()) %>%
arrange(desc(number_of_rides))
top 5 starting stations
#________________TOP 5 STATIONS WHERE RIDES WERE INITIATED
top_5 <- top_stations %>%
slice(1:5)
observations
- Interesting numbers, the top 3 stations individually have about 56% more rides than the 4th busiest station
- I will zoom in on these 3 top stations and see what insight I can gather
top 3 change/year/day
#_________________LET'S SEE % CHANGE FROM YEAR TO YEAR PER DAY FOR THE TOP 3 STATIONS rides_top_stations <-trips_19_20 %>% filter(start_station_id == 192 | start_station_id ==77 | start_station_id == 91 ) %>% #TOP 3 ONLY group_by(week_day,start_station_id,year) %>% summarise(number_of_rides =n()) %>% mutate( pct_change = percent((number_of_rides - lag(number_of_rides))/lag(number_of_rides))) %>% arrange(desc(start_station_id)) %>% gt() %>% tab_header(title =md("**Top 3 stations: rides/day year/year** ")) gtsave( rides_top_stations, "rides_top_stations.png")
observations
- Fist off let me explain how to quickly read the table:
- It is divided in 3 vertical sections
- you’ll see in column 1: the number 1 is for day of the week (1=Sunday), the starting station id # 192, then you’ll see two rows for 2019 and 2020, column 2 is number or rides, column3 is percent change from the year before for that same day and same station
- The obvious: weekday numbers 2-6 are more than 10X the weekend day numbers
- Station 192: weekday ride numbers have smaller percent increase from 2019
- Station 91: every day has a decrease in ride numbers except for Sunday with the decreases being significant close to an 20% decrease on average
- Station 77: saw decreases in from Thursday through Saturday with Wednesday having a 1% increase which is not significant
- I’ll dig into the following next to see if it leads somewhere:
- There are about 4000 rides per WEEKDAY initiated from just these 3 stations.
- Is inventory being tracked at these stations, if so how? If not, are we relying on users to walk to nearby stations?
- Proximity of these 3 stations to one another
- First thing I worry about: are there enough bikes to rent out from these 3 locations?
- Is U/X the cause of the decline in rides?
- I will check to see how many rides terminate at these top 3 stations to see if inventory is being replenished by migration from other stations?
- I will check to see how many cyclist start and end their rides at these top 3 stations to see if that affects the supply of bikes
top 3 rides/2020
#________________DAILY RIDES STARTED FROM THE TOP 3 STATIONS FOR 2020
started_top3 <- trips_19_20 %>%
filter(start_station_id == 192 | start_station_id ==77 | start_station_id == 91 & year == 2020 ) %>%
group_by(week_day,start_station_id) %>%
summarise(number_of_rides =n()) %>%
arrange(desc(start_station_id)) %>%
gt()%>%
tab_header(
title =md("**Top 3 stations: rides/day for 2020** "),
subtitle = md("STARTING stations")
)
gtsave( started_top3, "started_top3s.png")
top 5 ending stations
#_____________________________ FIGURE OUT THE TOP ENDING STATIONS ending_trips<- trips_19_20 %>% group_by(end_station_id) %>% summarise(number_of_rides =n()) %>% arrange(desc(number_of_rides)) #____________________________FILTER TO TOP 5 ENDING STATIONS AND PRINT A TABLE end_5 <- ending_trips %>% slice(1:5) %>% gt() %>% tab_header(title =md("**Top 5 ENDING STATIONS** ")) gtsave( end_5, "end5.png")
chart rides per day
#_______________________________PLOT_________________________
trips_19_20 %>%
group_by(member_casual, week_day) %>%
summarise(number_of_rides =n()) %>%
ggplot(aes(x= week_day, y= number_of_rides, fill = member_casual, label = number_of_rides)) +
geom_bar(stat = "identity") +
geom_text(size = 3, position = position_stack(vjust = 0.5))+
labs(title="Total rides per day per user type")
PROCESS – SQL
We already changed column names to match in both tables earlier in R, so let’s continue from there with SQL.
filter tables
join tables
SELECT CAST(ride_id AS STRING) AS ride_id, started_at, ended_at, start_station_id, start_station_name, member_casual FROM`prime-depot-415622.BikeShare.trips19v1` UNION ALL SELECT ride_id, started_at, ended_at, start_station_id, start_station_name, member_casual FROM `prime-depot-415622.BikeShare.trips20`
rename member_casual
SELECT
CAST(ride_idASSTRING)ASride_id,
started_at,
ended_at,
start_station_id,
start_station_name,
member_casual
FROM`prime-depot-415622.BikeShare.trips19v1`
UNION ALL
SELECT
ride_id,
started_at,
ended_at,
start_station_id,
start_station_name,
CASE
WHEN member_casual = 'member' THEN 'Subscriber'
WHEN member_casual = 'casual' THEN 'Customer'
ELSE member_casual
END
FROM `prime-depot-415622.BikeShare.trips20`
verify trip count
Same as we did with R above in section table() to verify, let’s check to see how many trips we have for each type of member_casual. Once again I’ll color the new code (compared to the above) in green. The result displayed in the table are the same as what we had before with R up to this point.
- We’ll create a subquery to enclose what we’ve done so far
- Name the subquery -> subquery
- Group and count the number of occurrences for each type of rider
SELECT subquery.member_casual, COUNT(*) AS trip_count FROM ( SELECT CAST(ride_id AS STRING) AS ride_id, started_at, ended_at, start_station_id, start_station_name, member_casual FROM `prime-depot-415622.BikeShare.trips19v1` UNION ALL SELECT ride_id, started_at, ended_at, start_station_id, start_station_name, CASE WHEN member_casual = 'member' THEN 'Subscriber' WHEN member_casual = 'casual' THEN 'Customer' ELSE member_casual END FROM `prime-depot-415622.BikeShare.trips20` ) AS subquery GROUP BY subquery.member_casual OUTPUT Subscriber Customer Total 720313 71643 791956
calculate ride_length
SELECT
TIMESTAMP_DIFF(subquery.ended_at, subquery.started_at, SECOND) AS ride_length
FROM
(
SELECT
CAST(ride_id AS STRING) AS ride_id,
started_at,
ended_at,
start_station_id,
start_station_name,
member_casual
FROM `prime-depot-415622.BikeShare.trips19v1`
UNION ALL
SELECT
ride_id,
started_at,
ended_at,
start_station_id,
start_station_name,
CASE
WHEN member_casual = 'member' THEN 'Subscriber'
WHEN member_casual = 'casual' THEN 'Customer'
ELSE member_casual
END
FROM `prime-depot-415622.BikeShare.trips20`
) AS subquery
filter out <=0 ride_length
SELECT TIMESTAMP_DIFF(subquery.ended_at, subquery.started_at, SECOND) AS ride_length FROM ( SELECT CAST(ride_id AS STRING) AS ride_id, started_at, ended_at, start_station_id, start_station_name, member_casual FROM `prime-depot-415622.BikeShare.trips19v1` UNION ALL SELECT ride_id, started_at, ended_at, start_station_id, start_station_name, CASE WHEN member_casual = 'member' THEN 'Subscriber' WHEN member_casual = 'casual' THEN 'Customer' ELSE member_casual END FROM `prime-depot-415622.BikeShare.trips20` ) AS subquery WHERE TIMESTAMP_DIFF(subquery.ended_at, subquery.started_at, SECOND)>0; OUTPUT Total rows: 791746
look for “HQ QR”
- It appears that 3557 rows have start_ or end_ AND ride_length > 0
start_station_name ="HQ QR" AND TIMESTAMP_DIFF(subquery.ended_at, subquery.started_at, SECOND) > 0
- It appears that 3768
WHERE start_station_name ="HQ QR" OR end_station_name = "HQ QR";
filter out “HQ QR”
SELECT TIMESTAMP_DIFF(subquery.ended_at, subquery.started_at, SECOND) AS ride_length FROM ( SELECT CAST(ride_id AS STRING) AS ride_id, started_at, ended_at, start_station_id, start_station_name, end_station_name, end_station_id, member_casual FROM `prime-depot-415622.BikeShare.trips19v1` UNION ALL SELECT ride_id, started_at, ended_at, start_station_id, start_station_name, end_station_name, end_station_id, CASE WHEN member_casual = 'member' THEN 'Subscriber' WHEN member_casual = 'casual' THEN 'Customer' ELSE member_casual END FROM `prime-depot-415622.BikeShare.trips20` ) AS subquery WHERE (start_station_name <> "HQ QR" AND end_station_name <> "HQ QR") AND TIMESTAMP_DIFF(subquery.ended_at, subquery.started_at, SECOND) > 0; OUTPUT Total rows: 788188
add timeframes
Let’s add the following columns so we can have the ability to expand our analysis. The way the data is setup now, all we have are trip_id and TIMESTAMP for start and finish.
SELECT
EXTRACT(DAYOFWEEK FROM subquery.started_at) AS day_of_week,
EXTRACT(MONTH FROM subquery.started_at) AS month,
EXTRACT(YEAR FROM subquery.started_at) AS year,
EXTRACT(QUARTER FROM subquery.started_at) AS quarter,
EXTRACT(DATE FROM subquery.started_at) AS date,
TIMESTAMP_DIFF(subquery.ended_at, subquery.started_at, SECOND) AS ride_length
FROM
(
SELECT
CAST(ride_id AS STRING) AS ride_id,
started_at,
ended_at,
start_station_id,
start_station_name,
end_station_name,
end_station_id,
member_casual
FROM `prime-depot-415622.BikeShare.trips19v1`
UNION ALL
SELECT
ride_id,
started_at,
ended_at,
start_station_id,
start_station_name,
end_station_name,
end_station_id,
CASE
WHEN member_casual = 'member' THEN 'Subscriber'
WHEN member_casual = 'casual' THEN 'Customer'
ELSE member_casual
END
FROM `prime-depot-415622.BikeShare.trips20`
) AS subquery
WHERE
(start_station_name <> "HQ QR" AND end_station_name <> "HQ QR")
AND
TIMESTAMP_DIFF(subquery.ended_at, subquery.started_at, SECOND) > 0;
ANALYZE – SQL
Analyze with SQL
count rides/User
We create another subquery outside the other ones and note the code in green that was added to the previous block:
SELECT subquery2.member_casual, COUNT(*) AS trip_count FROM ( SELECT subquery.member_casual, EXTRACT(DAYOFWEEK FROM subquery.started_at) AS day_of_week, EXTRACT(MONTH FROM subquery.started_at) AS month, EXTRACT(YEAR FROM subquery.started_at) AS year, EXTRACT(QUARTER FROM subquery.started_at) AS quarter, EXTRACT(DATE FROM subquery.started_at) AS date, TIMESTAMP_DIFF(subquery.ended_at, subquery.started_at, SECOND) AS ride_length FROM ( SELECT CAST(ride_id AS STRING) AS ride_id, started_at, ended_at, start_station_id, start_station_name, end_station_name, end_station_id, member_casual FROM `prime-depot-415622.BikeShare.trips19v1` UNION ALL SELECT ride_id, started_at, ended_at, start_station_id, start_station_name, end_station_name, end_station_id, CASE WHEN member_casual = 'member' THEN 'Subscriber' WHEN member_casual = 'casual' THEN 'Customer' ELSE member_casual END FROM `prime-depot-415622.BikeShare.trips20` ) AS subquery WHERE (start_station_name <> "HQ QR" AND end_station_name <> "HQ QR") AND TIMESTAMP_DIFF(subquery.ended_at, subquery.started_at, SECOND) > 0; ) AS subquery2 GROUP BY subquery2.member_casual; OUTPUT member_casual trip_count Subscriber 720311 Casual 67877
group & sort
Let’s group the number of trips by user (member_casual) and group by day of the week.
#___________________________We just add the code in green to previous block: SELECT subquery2.member_casual, subquery2.day_of_week, COUNT(*) AS trip_count FROM ( SELECT subquery.member_casual, EXTRACT(DAYOFWEEK FROM subquery.started_at) AS day_of_week, EXTRACT(MONTH FROM subquery.started_at) AS month, EXTRACT(YEAR FROM subquery.started_at) AS year, EXTRACT(QUARTER FROM subquery.started_at) AS quarter, EXTRACT(DATE FROM subquery.started_at) AS date, TIMESTAMP_DIFF(subquery.ended_at, subquery.started_at, SECOND) AS ride_length FROM ( SELECT CAST(ride_id AS STRING) AS ride_id, started_at, ended_at, start_station_id, start_station_name, end_station_name, end_station_id, member_casual FROM `prime-depot-415622.BikeShare.trips19v1` UNION ALL SELECT ride_id, started_at, ended_at, start_station_id, start_station_name, end_station_name, end_station_id, CASE WHEN member_casual = 'member' THEN 'Subscriber' WHEN member_casual = 'casual' THEN 'Customer' ELSE member_casual END FROM `prime-depot-415622.BikeShare.trips20` ) AS subquery WHERE (start_station_name <> "HQ QR" AND end_station_name <> "HQ QR") AND TIMESTAMP_DIFF(subquery.ended_at, subquery.started_at, SECOND) > 0; ) AS subquery2 GROUP BY subquery2.member_casual, subquery2.day_of_week ORDER BY subquery2.day_of_week, subquery2.member_casual; OUTPUT: I created the same output in various charts with Tableau, as shown below
ANALYZE – TABLEAU
Chart analyses
total rides per day
rides per day/year
Let’s see what has happened from year to year for the same quarter Q1. It appears that the number of rides have gone up for all days EXCEPT Thursday and Fridays, where we see the number of rides have decreased
total rides per year
location of top 5
top 5 start stations
what’s interesting about the chart that follows is:
top 5 v2
top 5 end stations
top5 start=end
breakdown by quarters
The results show an obvious discrepancy between Subscriber’s trip count on weekdays (which is much higher) than Customers. While on weekend days it’s the opposite.
Let’s see if the results change in different quarters WAIT, the data provided is for Q1 of both 2019 and 2020 so there is no need to check on that.
That’s one thing we need to do is load up all months or create a separate df for each quarter then do the comparison.
#________We just add the code in green to previous block SELECT subquery2.quarter, subquery2.member_casual, subquery2.day_of_week, COUNT(*) AS trip_count FROM ( SELECT EXTRACT(DAYOFWEEK FROM subquery.started_at) AS day_of_week, EXTRACT(MONTH FROM subquery.started_at) AS month, EXTRACT(YEAR FROM subquery.started_at) AS year, EXTRACT(QUARTER FROM subquery.started_at) AS quarter, EXTRACT(DATE FROM subquery.started_at) AS date, TIMESTAMP_DIFF(subquery.ended_at, subquery.started_at, SECOND) AS ride_length FROM ( SELECT CAST(ride_id AS STRING) AS ride_id, started_at, ended_at, start_station_id, start_station_name, end_station_name, end_station_id, member_casual FROM `prime-depot-415622.BikeShare.trips19v1` UNION ALL SELECT ride_id, started_at, ended_at, start_station_id, start_station_name, end_station_name, end_station_id, CASE WHEN member_casual = 'member' THEN 'Subscriber' WHEN member_casual = 'casual' THEN 'Customer' ELSE member_casual END FROM `prime-depot-415622.BikeShare.trips20` ) AS subquery WHERE (start_station_name <> "HQ QR" AND end_station_name <> "HQ QR") AND TIMESTAMP_DIFF(subquery.ended_at, subquery.started_at, SECOND) > 0; ) AS subquery2 GROUP BY subquery2.member_casual, subquery2.day_of_week ORDER BY subquery2.quarter, subquery2.day_of_week, subquery2.member_casual;
same station rides
Let’s see how many rides ended at the location where they started:
WHERE
subquery2.start_station_id = subquery2.end_station_id
OUTPUT 23626
Observations
If we group trip_count and sort it in order, a glaring figure jumps out at us: the top 3 stations in rides (by 50% more than the fourth place are in this order: Station # 192, 91, 77
1 |
Subscriber
|
192
|
13799
|
2 |
Subscriber
|
91
|
13434
|
3 |
Subscriber
|
77
|
12891
|
4 |
Subscriber
|
133
|
8720
|
If we add the other users to the data and calculate it for all users, we get
1 |
192
|
14155
|
2 |
91
|
13640
|
3 |
77
|
13362
|
As you see the Customer users account for:
- 356 at station 192 out of the total of 14155 trips
- 206 at station 91 out of the total of 13640 trips
- 471 at station 133 out of the total of 13362
observations
- The company claims that “Cyclistic users are more likely to ride for leisure, but about 30% use the bikes to commute to work each day”, how is that numbered calculated?
- The two plans offered to casual riders: single-ride passes and full-day passes are not tracked in the data provided, all we have is user type: “Customer” or “Subscriber”
- The data provided doesn’t link any of the rides to an actual account. Why don’t we track by userId? Don’t Subscribers have to create an account? Why are we just sorting the data by “Member Casual”.
- We can gather more insight if the rideId is linked to a userId, is the data available but not provided for analysis?
- How are docking station locations related to subway, train, or bus stations?
- Are docking stations near parking structures or major tourist, business centers?
- Are users coming to the city via other mode of transportation and using bikes within the city for commuting?
- The objective of the project is to find ways to convert “Customers” to “Subscribers” which is logical except the data provided doesn’t support the theory proposed by Mrs. Moreno. The data shows a large discrepancy between the two users, as Customers account to 9% of Subscribers usage. See table below.
- The data does not show how many actual users are represented by each type. It is illogical to use rideId as userId. The data provided does not promote an insightful analysis regarding the particular hypothesis presented.
- The idea that converting 9% of rides will improve the bottom line, sure but at what cost? How many users are we converting? How many Subscribers do we already have?
- Considering the fact that I am not provided with data relevant to the proposed hypothesis, I would shift my focus on other issues that the data has exposed.
- The facts that weekday ride numbers have been on the decrease for the busiest stations is alarming (see the last table below). The top stations are very close in proximity to one another, so it is possible that users are not getting good service or possibly a competitor has entered the arena and is targeting that specific small area where the top stations are located. Maybe inventory at those stations doesn’t support the volume of rides initiated from there?
- The fact that inventory at those stations is not tracked needs to be addressed.
- The top stations are far more important to the bottom line than wasting resources on the hypothesis that has been proposed, with the data provided. We cannot worry about converting 9% of rides while the top 3 stations are losing rides at a higher pace and by larger numbers than 9%.
RECOMMENDATIONS
-
- Modify the data we track.
- Implement userId so we can focus our analysis and be more insightful.
- Focus on UX at the top stations.
- Stabilize ride numbers at the top stations.
- Reassess the situation in the near future after modifications have been implemented.