Contents

What follows are some case studies that include “Documentation of any cleaning or manipulation of data” as part of the deliverables so I’ve included all my step-by-step details.

I’ll start off with R then I’ll duplicate the process using SQL, and finally Tableau. Other analyses done in Python is not included and covered in other posts.

WELLABEAT – CS1

Founded in 2014, Wellabeat is the company that developed one of the first wearables specifically designed for women and has since gone on to create a portfolio of digital products for tracking and improving the health of women.

Focusing on creating innovative health and wellness products for women, their mission is to empower women to take control of their health by providing them with technology-driven solutions that blend design and function. Giving women the tools to reach their fullest potential through personalized wellness solutions aligned with their cycles. Here are the products that WellaBeat offers:

Wellabeat app: The Wellabeat app provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Wellabeat app connects to their line of smart wellness products.
Leaf: Wellabeat’s classic wellness tracker can be worn as a bracelet, necklace, or clip. The Leaf tracker connects to the Wellabeat app to track activity, sleep, and stress.
Time: This wellness watch combines the timeless look of a classic timepiece with smart technology to track user activity, sleep, and stress. The Time watch connects to the Wellabeat app to provide you with insights into your daily wellness.
Spring: This is a water bottle that tracks daily water intake using smart technology to ensure that you are appropriately hydrated throughout the day. The Spring bottle connects to the Wellabeat app to track your hydration levels.
Wellabeat membership: Wellabeat also offers a subscription-based membership program for users. Membership gives users 24/7 access to fully personalized guidance on nutrition, activity, sleep, health and beauty, and mindfulness based on their lifestyle and goals.

Here are some of the metrics that the app tracks based on information found on their site:

meditation rate
starts tracking at “first step”
starts tracking at “first shut eye”
ovulation cycle, temperature, pregnancy term….
cardiac coherence!!??
stress tolerance, once again is based on heart rate and temperature levels?
activity level

Purpose

The goal of this project is to analyze Wellabeat’s provided data to reveal more opportunities for growth.
Focus on one Wellabeat product and analyze smart device usage data in order to gain insight into how people are already using their smart non-Wellabeat devices.
Then, using this information, make high-level recommendations for how these trends can inform Wellabeat marketing strategy.
My report will be directed to the executive team:
- Urška Sršen: Wellabeat’s cofounder and Chief Creative Officer
- Sando Mur: Mathematician and Wellabeat’s cofounder, key member of the Wellabeat executive team

Deliverables

I will produce a report with the following deliverables:

A clear summary of the business task.
A description of all data sources used
Documentation of any cleaning or manipulation of data
A summary of my analysis
Supporting visualizations and key findings
Top high-level content recommendations based on the analysis

Business task

The executive team believes that analyzing smart device fitness data from other products on the market could help unlock new growth opportunities for the company. I have been asked to focus on one of Wellabeat’s products and analyze external smart device data to gain insight into how consumers are using their smart devices. The insights I discover will then help guide marketing strategy for the company.

Scope/major activity

data collection

FitBit Fitness Tracker Data (CC0: Public Domain, dataset made available through Mobius): This Kaggle dataset contains personal fitness tracker from thirty fitbit users. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits.
It is believed that the provided dataset has limitations.

identify one product

After reviewing WellaBeats site I realized that the information provided by the company differs from what is actually displayed on the current site. I’ll use of the following products to focus the analysis. All of the listed products provide the same data to the user, the varying features are the types of devices they are:

- Ivy+ – wrist wearable tracker
- Leaf Urban – could be worn as necklace, in a wrist band
- Leaf Chakra -could be pinned to a collar, blouse, placed in wrist band

create recommendations

What are some trends in smart device usage?
How could these trends apply to Wellabeat customers?
How could these trends help influence Wellabeat marketing strategy?

Ask

There are many questions I have that went unanswered;

Why are we using data from 2016?
The data provided appears to target the fitness/workout community: steps per day, workout type, workout duration… while WellaBeat appears to be more of mental wellness for pregnant women!?
Is the company thinking about branching into that highly competitive domain?
Why doesn’t WellBeat provide its own data for analysis?
Does WellBeat collect data?
Does WellBeat know how to track the metric most used or viewed by its users?
Does WellBeat know how to track pages in the app that are most viewed by its users?
Why isn’t the science behind the claims described on the site more transparent? Being open with the science behind the metrics might create a loyal customer base!
WellaBeat claims it targets women, does WellBeat know exactly how many of its users are women?
Does WellaBeat know the age range or concentration age of its user?
The site makes it obvious that the target market are pregnant women, does WellaBeat have any idea how many of their users are/were/will be pregnant?

Prepare

data

Data is old: 3/2016 to 5/2016
Covers 30 users of FitBit Fitness tracker
Period of 62 days
After reviewing Wellabeat’s site it appears that the data provided is geared more towards fitness rather than monitoring what Wellabeat monitors
Workouts metric used by WellaBeat is in common with the data provided
Activity level and calories burned are additional common metrics
One very important metric in the data is the manual weigh-ins that’s not been tracked by WellaBeat
The data provided shows activity levels, distance walked, calories burned…. are those manually triggered by the user or does the product continuously collect data? What I need to know is if the user actually turns on the data tracking manually or does the app track continually. I’d like to know how invested are the users. In other words if the app tracks all that information automatically, then how do we know if the user pays attention, reviews the data or makes life decisions based on what’s been tracked? In other words, is the data provided for ananlysis helping any of the users, or just metrics that the user doesn’t review. If the user does review it then how do we know how many of the users actually review the data, how helpful are these metrics to the user? Are we solving a problem? I have a nike app on my phone I used 2-3 times to track my running path, length and distance. I haven’t used it in years and yet it still tracks my every day movements! That’s data it collects that I never look at or even use the app.
Data doesn’t mention the gender, race, or age of the 30 users
Activity level measured in the app? how is that actually calculated? is the app linked to an exercise device?
Does the app track the distance traveled by the phone and if the movement is greater than a specific speed it assumes it’s a hight intensity workout?! ….
With all these questions, it’s obvious that there is a clear bias in the data

PROCESS – R

Note: I will be using R, SQL, and Tableau for processing and analysis, most of the same steps done in R will be repeated in SQL

Cleaning with R

read data

#______________________________INSTALL PACKAGES
install.packages("tidyverse")
install.packages("skimr")
install.packages("janitor") install.packages("ggplot2")
#______________________________LOAD PACKAGES
library(tidyverse)
library(skimr)
library(janitor)
library(ggplot2)
library(lubridate)
library(stringr)
#______________________READ DAILY ACTIVITY DATA FROM MONTHS 3 TO 4 OF 2016
dailyactivity_df_3_4 <- read_csv("dailyActivity_merged_3_4.csv")
head(dailyactivity_df_3_4)
#_______________________FILTER TO RELEVANT COLUMNS
activity_3_4 <- dailyactivity_df_3_4 %>%
   select(Id,ActivityDate, TotalSteps, TotalDistance, Calories)
#________________________CONVERT FROM <char> TO <date> AND GROUP BY Id
edited_activity_3_4 <- activity_3_4 %>%
   mutate(mDate = mdy (ActivityDate)) %>%
   group_by(Id) %>%
   arrange(mDate, .by_group = TRUE)
View(edited_activity_3_4)
OUTPUT
# TOTAL 457 ROWS

observations

The obvious path would be to visualize the correlation between all the metrics collected in the data such as calories burned vs workout type, workout length, workout load type… but in my opinion that would be a waste of time and is irrelevant to our business task. It’s been known for centuries what the relationship is between working out, workout load, time under stress, and calories burned.

Each variable contributes directly to calories burned, we are not doctors nor would 30 users monitored for 62 days change the science, so I will not spend time on the majority of the data provided.

My focus will be on UX

Does the product provide a solution to the users?
Is there a benefit to the user for using the product?
There is one metric that’s very important and displays active interaction with the users and that’s manual weigh-in
For the user to take the time of getting weighed, and then manually enter the information in the app is extremely important. It’s possible that the app used to collect the data could be linked to a bluetooth or wifi enabled scale and automatically logs in the weight once a value is read!?
That would be a great service/device that could be a focus for WellaBeat.

read data

#_____________________READ WEIGHT LOGIN FROM 3 TO 4 OF 2016
weightloginfor_df_3_4 <- read_csv("weightLogInfo_merged_3_4.csv")
#Convert the Date column from <chr> to <date> and groupby ID just like we did earlier
#it appears that most of the Dates are displayed as 11:59:59 PM so we have to assume that maybe the system
#purges at midnight so let's edit the date to drop the hms.
#It appears that when we drop the hms we get NA so we'll keep it as
#mdy_hms and convert from date-time to date with as_date() after we convert the <chr> to datetime

edited_weighin_3_4 <- weightloginfor_df_3_4 %>%
     select(Id, Date, WeightPounds, BMI) %>% 
     mutate(mDate = as_date(mdy_hms (Date))) %>% 
     group_by(Id) %>% 
     arrange(mDate, .by_group = TRUE)
View(edited_weighin_3_4)

OUTPUT
# TOTAL 33 ROWS

observations

What’s interesting is that 33 out of 30 users logged in to the manual weight-in page for March. Do we assume some are duplicates?
Maybe they created multiple accounts?
Is it a requirement for all users to login to the weight-in page before the app starts tracking and making calculations?
Are the calories burned calculations based on weight of the user?
Is it possible that once the user installs the app and create an account it automatically shows that they logged in to the weigh-in page?
I need to dive into this data more and see how many of the users may have entered their weight and how many just logged in and never entered their weight.
Is this data from newly enrolled users? or
Was the data extracted from a random timeframe?

merge data

#___________________________MERGE DAILY ACTIVITY AND WEIGHT-IN FOR MARCH _______________________
merged_3_4 <- left_join
              (edited_activity_3_4,
              edited_weighin_3_4,
              by =c("Id","mDate"))
View(merged_3_4)
OUTPUT
# TOTAL 457 ROWS


#_____________________REPEAT ABOVE PROCESSES FOR DAILY ACTIVITY DATA FROM 4 TO 5 OF 2016_________
dailyactivity_df_4_5 <- read_csv("dailyActivity_merged_4_5.csv")
#_____________________FILTER DATASET
activity_4_5 <- dailyactivity_df_4_5 %>%
   select(Id,ActivityDate, TotalSteps, TotalDistance, Calories)
#_____________________CONVERT DATE COLUMN
edited_activity_4_5 <- activity_4_5 %>% 
   mutate(mDate = mdy (ActivityDate)) %>% 
   group_by(Id) %>% 
   arrange(mDate, .by_group = TRUE)
View(edited_activity_4_5)
OUTPUT
# TOTAL OF 940 ROWS

#_____________________READ WEIGHT LOGIN FROM 4 TO 5 OF 2016_________
weightloginfor_df_4_5 <- read_csv("weightLogInfo_merged_4_5.csv")
#_____________________FILTER DATASET AND CONVERT TOGETHER
edited_weighin_4_5 <- weightloginfor_df_4_5 %>%
   select(Id, Date, WeightPounds, BMI) %>% 
   mutate(mDate = as_date(mdy_hms (Date))) %>% 
   group_by(Id) %>% 
   arrange(mDate, .by_group = TRUE)
View(edited_weighin_4_5)
OUTPUT
# TOTAL OF 67 ROWS

#______________________MERGE BOTH APRIL DATASETS
merged_4_5 <- left_join
              (edited_activity_4_5,
               edited_weighin_4_5,
               by =c("Id","mDate"))
View(merged_4_5)
OUTPUT
# TOTAL OF 940 ROWS

#______________________BIND THE TWO MERGED TO YIELD FILTERED ENTIRE DATASETS 
# SHOULD HAVE 457 + 940 = 1397 ROWS
merged_data_3_5 <- rbind(merged_3_4, merged_4_5) 
#____________________REVIEW STRUCTURE OF ALL COLUMNS
str(merged_data_3_5)
#______________________FILTER OUT NOT RELEVANT COLUMNS
data_3_5 <- merged_data_3_5 %>% 
   select(Id, mDate, TotalSteps, TotalDistance, Calories, WeightPounds, BMI)

#_____________________SAVE AS NEW DATASET _____________________
file.create("bellabeat_cleaned_filtered_3_5.csv")
write_csv(data_3_5,"bellabeat_cleaned_filtered_3_5.csv")

ANALYZE – R

Analyze with R

logins

Let’s see how many users actually manually logged in to to the weight app:

#__________________LET'S SEE HOW MANY USERS LOGGED IN TO THE WEIGHT APP
logins <- data_3_5 %>% 
            group_by(Id) %>% 
            summarize(number_of_logins = n()) %>% 
            arrange(desc(Id))
We get 35 users that logged in to the weight-in app. The table is below with the weighins table

weighins

#_______________LET'S SEE HOW MANY OF THE USERS ACTUALLY ENTERED A WEIGHT (WEGHINS)
weighins <- data_3_5 %>%
          ungroup() %>%
         filter(WeightPounds != "NA") %>%
         group_by(Id) %>%
         summarize(number_of_weighins = n()) %>%
        arrange(desc(Id))

As you can see only 13 out of the 35 users actually entered their weight in the app.

logged & weighed

#________________LET'S JOIN THE TWO TOGETHER TO SEE HOW MANY LOGGED-IN AND WEIGHED-IN
login_weighin <- left_join (logins, weighins, by = ("Id"))

#___________I COULD USE SLICE() TO FILTER OUT THE RELEVANT ROWS BUT THE TABLE IS SMALL ENOUGH TO READ

NOTE: it is extremely important to not join logins and weighins after you execute gt() above. So perform the join and get table printed out on the join then go back and gt() the previous two tables if you wish to output the prior two into gt(). If you do you’ll get this error:

Error in UseMethod("left_join") : 
  no applicable method for 'left_join' applied to an object of class "c('gt_tbl', 'list')"

observations

The data is promising

out of 30 users that were tracked (the data show 35 users which might be from duplicate accounts?!) so out of the 30 users 13 actually logged in to the weight app on a regular basis as is displayed in the number of logins column
Over a 60 day period you can see in column 2 that the users logged in on an extremely consistent basis, 30, 40 and over 50 times
13 users entered their weight in the app. Once again, I really have no way of knowing if the weight-in side of the app is any different from the main app, but I have to assume that it is because the login information is included in the weight-in data not in the daily activity data.
6 out of 13 entered their weight in 1 time over a 60 day period. I need to follow up on these users to inspect their activity levels over the time tracked. Maybe their lone entry could be explained by their acitivity level.
5 of of the 13 entered their weight multiple times. Again over a 60 day period you really don’t expect much change in weight unless you are trying to actually gain or lose it. So once again we need to investigate their activity level to see if there is a correlation between multiple entry and activity, maybe they started or finished on a program?
1 user had 33 and another user had 44 weigh-ins. That seems to be extreme as to someone is either on a hardcore weight loss program or maybe pregnant or possibly post pregnancy. Once again activity level needs to be investigated.
Overall 13 out of 30 users is a great percentage and that appears to be a function that provides a great UX and needs to be added to BellaBeat.

ANALYZE – SQL

Analyze with SQL

unique Ids

Let’s look at how many unique Ids we have, we were told that the data was provided on 30 users. The data generated is saved in file: unique_Id_total_35.csv

NOTE: I’m using a condensed code spacing and indentation to save on screen size

SELECT  COUNT(DISTINCT(Id))
FROM    `prime-depot-415622.bellabeat.bellat_merged`
The code gives us 35.

weightLogIn entries

Let’s see how many entries in the weightLogInfo dataset were not NA for both columns BMI & Weightkg/pounds. In other words how many times did the user enter a value for either BMI or Weight out of the 1397 total entries. Data is saved in file: weight_or_bmi_entries_100.csv.

SELECT   Id, mDate
FROM     `prime-depot-415622.bellabeat.bellat_merged`
WHERE     BMI <> "NA" AND WeightPounds <> "NA"
The answer is 100 rows.

weightLogIn users

Let’s see how many of those weightLogIn entries belonged to unique users. In other words how many users actually logged in their weight. Data is saved in file: users_that_weighedin_13.csv

SELECT  Id, COUNT(*) AS number_of_weighins
FROM  `prime-depot-415622.bellabeat.bellat_merged`
WHERE   BMI <> "NA" AND  WeightPounds <> "NA"
GROUP BY  Id
ORDER BY  number_of_weighins
The answer is 13. So out of the 35 unique users 13 users actually manually?! entered their weight. That's 37%.

multiple entries

Let’s see how many users had multiple entries of their weight:

A brief explanation of the code below:

Start from the subquery within (colored in blue) : Select the Id column, and then COUNT how many times each ID which is GROUPED BY has loggedin their weight.
We can find that out by filtering the dataset with the WHERE clause that only looks at when values for BMI AND WeightPounds were entered and NOT NULL or =”NA”
Then we group each entry by ID and count them and
Name the subquery AS subquery
Now let’s step back out to the outer query and SELECT what we just did in the subquery and pull the ID and
Assign the count of the number_of_weighins to multiple_logins because we will use a
WHERE clause to just display the times a user logged in multiple times to enter their weight
That value comes up to 7
Breakdown of the multiple logins are as folllows:

SELECT
     subquery.Id,
     subquery.number_of_weighins AS multiple_logins
FROM
    (
     SELECT
         Id,
         COUNT(*) AS number_of_weighins
     FROM
         `prime-depot-415622.bellabeat.bellat_merged`
     WHERE
         BMI <> "NA" AND WeightPounds <> "NA"
     GROUP BY
         Id
     ) AS subquery
WHERE
    subquery.number_of_weighins > 1
ORDER BY
    number_of_weighins

For a period of 62 days (from 3/12/2016 to 5/12/2016), out of the 13 unique users that logged in their weight, how many actually had multiple entries?

[table id=4 /]

observations

It appears that it is not necessary for the user to enter their weight for the app to make its calculations
Out of the 30 users provided we find 35 unique IDs
One user entered their weight 44 times in 62 days.
13 out of 35 unique users used the manual weigh-in feature that’s 37% which is significant over a 62 day perioData time frame and sample size are small so the validity of my findings are highly unreliable

RECOMMENDATIONS

UX should be the number one priority for WellaBeat. The app needs to be redesigned to include the ability to track users use of the app.
WellaBeat’s mission is to help women achieve a healthy lifestyle, is that actually happening? Have we asked the users? Have we collected data to backup our claims?
Aside from looking within and improving our existing offerings, my main recommendation is to introduce a scale that syncs with the existing app if product design, supply, and maintenance is not a deterrent.
The other option would be to introduce an additional service via the existing app that can link with smart scales via bluetooth and/or wifi.
I’d imagine tracking the weight for pre/post and during pregnancy is extremely important. It’s obvious from the limited data provided that weigh-ins were extremely popular and effective in the marketplace.
Why did WellaBeat stop using Spring the fluid intake tracking product? Hydration is one of the most important aspects of health and fitness – even more so during or post pregnancy? Was that a product/supply issue? That should be a focus.
App should track which page is viewed, how often each page is viewed, time of day viewed. After that the data will give us more insightful information.
Age, gender, pregnancy status needs to be tracked either via a questionnaire/survey, account/profile creation.
My suggestions are easy to implement and should provide more insight as to the direction marketing should proceed in the near future.

BIKESHARE – CS2

A bike-share program that features more than 5,800 bicycles and 600 docking stations. Bikeshare sets itself apart by also offering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who can’t use a standard two-wheeled bike. The majority of riders opt for traditional bikes; about 8% of riders use the assistive options. Bikeshare users are more likely to ride for leisure, but about 30% use the bikes to commute to work each day.

In 2016, Bikeshare launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime. Until now, Bikeshare’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans:

single-ride passes
full-day passes and
annual memberships

Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are members. Bikeshare’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Bikeshare tract more customers, Moreno believes that maximizing the number of annual members will be key to future growth.
Rather than creating a marketing campaign that targets all-new customers, Moreno believes there is a solid opportunity to convert casual riders into members. She notes that casual riders are already aware of the Bikeshare program and have chosen Bikeshare for their mobility needs.

Purpose

The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Bikeshare bikes differently.

From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Bikeshare executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.

Moreno has set a clear goal:

Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the team needs to know

How do annual members and casual riders use bikes differently?

Why would casual riders buy annual memberships?

How can Bikeshare use digital media to influence casual riders to become members?

Moreno and her team are interested in analyzing the historical bike trip data to identify trends.

Stakeholders

Lily Moreno: The director of marketing and your manager. Moreno is responsible for the development of campaigns and initiatives to promote the bike-share program.
These may include email, social media, and other channels.
Bikeshare marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide the marketing strategy.
You joined this team six months ago and have been busy learning about Bikeshare’s mission and business goals—as well as how you, as a junior data analyst, can help Bikeshare achieve them.
The executive team: The notoriously detail-oriented executive team will decide whether to approve the recommended marketing program.

Deliverables

1. A clear statement of the business task
2. A description of all data sources used
3. Documentation of any cleaning or manipulation of data
4. A summary of your analysis
5. Supporting visualizations and key findings
6. Your top three recommendations based on your analysis

Prepare

data

Bikeshare’s historical trip data is found here: Downloadable here

The data has been made available by Motivate International Inc. under this license.) This is public data that you can use to explore how different customer types are using Bikeshare bikes. But note that data-privacy issues prohibit you from using riders’ personally identifiable information.

Note: if using Posit’s RStudio, use the Divvy 2019 Q1 and Divvy 2020 Q1 datasets. Choosing other data might lead to errors because the data exceeds the memory available in the free plan.

load data

#______________________________INSTALL PACKAGES
install.packages("tidyverse")
install.packages("skimr")
install.packages("janitor")
install.packages("ggplot2")
install.packages("readr")
install.packages("readxl")
install.packages("gt")
install.packages("webshot2")
#______________________________LOAD LIBRARIES
library(tidyverse)
library(skimr)
library(janitor)
library(ggplot2)
library(lubridate)
library(stringr)
library(readr)
library(readxl)
library(dplyr)
library(gt)
library(webshot2)
#______________________READ DATA FROM Q1 2019 AND 2020
trips19 <- read_excel("BikeShare/BikeShareOriginalData/Divvy_Trips_2019_Q1.xlsx")
trips20 <- read_excel("BikeShare/BikeShareOriginalData/Divvy_Trips_2020_Q1.xlsx")

observations

The two plans offered to casual riders: single-ride passes and full-day passes are not tracked in the data provided all is tracked is whether the user is “Customer” or “Subscriber”.
The company claims that “Cyclistic users are more likely to ride for leisure, but about 30% use the bikes to commute to work each day”, how is that numbered calculated? nothing in the data shows how it can possibly be calculated!
UserId is not tracked!? How are we to guess how many users are using the system? This is basic data collection issue which makes the data broad and almost useless.
Birthyear and gender data is collected which could become helpful if we can figure out how many users we have.
Up to this point I pretty much know which direction I have to shift my analysis based on what’s provided in the data. This is not bias by any means, it is the simple understanding that the company lacks the knowledge of collecting the right data. It’s like trying to estimate the mpg for a car based on data collected on wiper fluid level.
At this point I’d have to take my concerns to my manager and advise them of the limitations of the data collected. I will also need to bring up the fact that her hypothesis ” Moreno believes there is a solid opportunity to convert casual riders into members” will be impossible to be justified with this data since we don’t know the exact number of casual and member users.
Either additional data needs to be provided or the system needs to be adjusted to collect the userId.
Since the data is provided for Customers and Subscibers as a whole and not broken down to each user I’ll have to see if any
Column naming took a significant change in 2020 compared to 2019:
- trip_id>ride_id, start_time>started_at, end_time>ended_at,bike_id & tripduration & gender & birthyear were all removed, from_station_id & from_station_name>start_station_name & start_station_id, to_station_id & to_station_name > end_station_id & end_station_name, usertype > member_casual
- end_lat & end_lng & start_lat & start_lng were added

PROCESS – R

I usually detail my steps in “documents” for record keeping, in this project “Documentation of any cleaning or manipulation of data” is part of the deliverables so I’ve included it below.

Note: I will be using R, SQL, and Tableau for processing and analysis, most of the same steps done in R will be repeated in SQL

Cleaning with R

rename

#_____________________RENAME 2019 COLUMNS TO MATCH 2020
trips19 <- rename( trips19,
            ride_id = trip_id,
            started_at = start_time,
            ended_at = end_time,
            rideable_type = bikeid,
            start_station_name = from_station_name,
            start_station_id = from_station_id,
            end_station_name = to_station_name,
            end_station_id = to_station_id,
            member_casual = usertype)

change type

trips19 has ride_id as “number” while trips20 has it as “chr”, so let’s change trips19 to “char” so we can merge/bind both datasets together without any issues!

#____________________CONVERT ride_id & rideable_type FROM num TO char
trips19 <- mutate (trips19,
            ride_id = as.character(ride_id),
            rideable_type = as.character(rideable_type))

vertical merge datasets

rbind() is similar but REQUIRES equal number and identical columns for all datasets. So we use bind_rows() because trips19 has unique columns as well as common columns with trips20: tripduration, gender & birthyear that aren’t in trips20. On the other hand, trips20 has columns: start_lat, start_lng, end_lat, end_lng that don’t exist in trips19

#___________________COMBINE BOTH DATASETS VERTICALLY VIA  bind_rows()
trips19_20 <- bind_rows(trips19, trips20)

filter/Exclude Columns

#__________________FILTER NEW DATASETS PRIOR TO MANIPULATION - EXCLUDE CERTAIN COLUMNS
trim_trips19_20 <- subset(trips19_20,
                   select = -c( gender, birthyear))

#__________________OR YOU CAN DO IT THIS WAY AS WELL
trim_trips19_20 <- trips19_20 %>% 
                   select(-c( gender, birthyear))

replace with case_when()

member_casual column had values in the 19 dataset that we need to update to match the values we’ll use from 2020 and forward. We can use case_when or case_match. recode() has been replaced with case_match:

#_________________RENAME member_casual COLUMN FROM member & casual TO Subscriber & Customer
all_trips19_20 <- trim_trips19_20 %>% 
     mutate(member_casual = case_when(
        member_casual == "member" ~ "Subscriber",
        member_casual == "casual" ~ "Customer",
        member_casual == "Subscriber" ~ "Subscriber",
        TRUE ~ "Customer"
     ))

table() to verify

Just to make sure we ended up with ONLY two desired values and we didn’t miss anything we can use the colde below to check how many values exist in column member_casual:

#_________________CHECK TO SEE IF ALL WE HAVE ARE THE TWO DESIRED VALUES
table(all_trips19_20$member_casual)
OUTPUT
  Customer Subscriber 
     71643     720313

Observations

My prior observations regarding the data is validated, as we can only gather the total number of rides taken by “Customers” vs “Subscribers”.
We can break down each type per station, per day, per hour, per gender, per age, but it’s impossible to know how many actual “riders” we are dealing with. How are we supposed to know how many “Customer” users are there?
How can we even imagine an entire marketing campaign in an attempt to convert users that we can’t quantify?
The data shows that 9.95% of the trips are taken by Customers!
Once again, not knowing how many acutal users derived all these rides it is NOT ADVISABLE to spend ANY resources trying to analyze nor devise a marketing strategy or campaign to convert any of the 9.95% of our rides.
My recommendation to my manager that we stop this analysis at this point until additional data is provided, or collection methods are revised.

calculate ride_length

difftime yields a result in drtn which looks like this in df: 390 secs, 441 secs,…. so I used as.numeric to convert it to a num which drops the secs string as well

#____________________CALCULATE ride_length AND DROP tripduration
all_trips19_20 <- all_trips19_20 %>% 
     mutate(ride_length = as.numeric(difftime(ended_at, started_at, units = 'secs')))

drop column

#____________________DROP tripduration COLUMN
all_trips19_20 <- all_trips19_20 %>% 
     select(-c(tripduration))

check for ZERO values

Let’s check to see what kind of values are in ride_length

#___________________CHECK FOR  VALUES==0 IN ride_length
any(all_trips19_20$ride_length == 0)
[1] TRUE

check for <0 values

#___________________CHECK FOR NEGATIVE VALUES IN ride_length
any(all_trips19_20$ride_length < 0)
[1] TRUE

#___________________ANOTHER WAY IS USING ALL
all(all_trips19_20$ride_length > 0)
[1] FALSE

check for NA values

#___________________CHECK FOR NA VALUES IN ride_length
any(is.na(all_trips19_20$ride_length))
[1] FALSE

#__________________OF COURSE WE CAN CHECK FOR MISSING DATA 
any(!is.na(all_trips19_20$ride_length))
[1] TRUE
#__________________OF COURSE WE CAN CHECK FOR NON-MISSING DATA < 0 or == 0
any(!is.na(all_trips19_20$ride_length) < 0)
[1] TRUE

count trips of 0 length

#___________________COUNT HOW MANY ride_length VALUES == 0
sum((all_trips19_20$ride_length == 0))
[1] 93

count <0 values

#___________________COUNT HOW MANY NEGATIVE ride_length VALUES WE HAVE
sum((all_trips19_20$ride_length < 0))
[1] 117

identify <0 rows

You could install another package if you want to identify the rows that contain the negative ride_length.

#______________________IDENTIFY ROWS WITH NEGATIVE ride_length
install.packages("fixr")
library(fixr)

check_for_negative_values(all_trips19_20$ride_length)
Data frame contains negative values.
  [1] 520402 648667 651963 652290 653845 655288 655654 656362 657167 658842 
 [19] 669752 678262 678499 678889 679182 679252 680301 680757 682576 682662 
 [37] 694476 694635 696693 697774 698938 699115 699204 699450 699619 699855 
 [55] 714612 714687 714761 715171 715474 719275 719680 725559 726496 730499 
 [73] 742697 743604 743619 745898 747194 747233 748686 754006 754853 755032 
 [91] 763358 766631 767985 768102 768693 768833 772284 772868 774788 775777 
[109] 779170 779496 781519 781699 786503 786687 788671 788946 789252

As you see if we specifically target the ride_length column we get a list of all the rows that contain a negative in that specific column

identify ALL <0 values

Let’s see if any other NEGATIVE values exist in the entire dataset. The output displays all the rows where a NEGATIVE value appears in a specific column. As you can see we have 117 rows, and the only column with negative values is column 10 which happens to be the ride_length column.

#_____________________IDENTIFY ANY NEGATIVE VALUES IN DF
check_for_negative_values(all_trips19_20)
Data frame contains negative values.
          row col
  [1,] 520402  10
  [2,] 648667  10
  [3,] 651963  10
  [4,] 652290  10
  ...
[115,] 788671  10
[116,] 788946  10
[117,] 789252  10

extract ride_length <0 rows

Let’s see what stands out in these rows with NEGATIVE ride_length values. So let’s create a new df with just those rows

#_____________________LET'S CREATE A DF WITH JUST THE NEGATIVE ride_length FOR REVIEW
neg_length <- all_trips19_20[all_trips19_20$ride_length < 0, ]
View(neg_length)

the new df shows 117 rows and:

all start_station_name and end_station_name == HQ QR
of course all start_station_id and end_station_id are all the same == 675
After further research I realize that these were bikes taken out of circulation for quality control, maintenance…
So we need to remove these rows from the dataset

extract ride_length = 0 rows

#_____________________LET'S CREATE A DF WITH JUST THE ZERO ride_length FOR REVIEW
zero_length <- all_trips19_20[all_trips19_20$ride_length == 0, ]
View(zero_length)

Similar to the NEGATIVE value rows all observations listed above apply here as well, so we’ll be removing these rows from the dataset as well. There are 93 rows in this df.

check df dimensions

Before we delete any rows, let’s check the dimensions of the df

#_____________________CHECK DF DIMENSIONS
dim(all_trips19_20)
[1] 791956     10

delete <= 0 rows

I will not delete any rows, instead I’ll create a new df excluding the rows with <=0 length, or if you want to go further and include another logical condition OR | where station names are = HQ QR you can be my guest. I will not do it, instead I will check the new df afterwards for station names == HQ QR and deal with that at the time if they exist. I believe including that condition in this step might cause results I am not ready for yet. We know for sure we only have 117+93= 210 rows so let’s focus on that now.

#____________________CREATE A NEW DF WITH POSITIVE ride_length
all_trips19_20_v2 <- all_trips19_20 [all_trips19_20$ride_length > 0,]
dim(all_trips19_20_v2)
[1] 791746     10

extract “HQ QR” rows

Now that we’ve taken out all rows that had a ride_length <= 0, let’s go back and see if there are any rows where start_station_name or end_station_name columns had values of “HQ QR”.

#____________________LET'S CHECK TO SEE IF "HQ QR" APPEARS IN ANY *_station_name COLUMNS
hq_anywhere <- all_trips19_20_v2 %>% 
     filter_all(all_vars(start_station_name == "HQ QR" | end_station_name == "HQ QR"))
View(hq_anywhere)

It appears that 3558 rows have HQ QR and their ride_length is >0. So we need to remove them as well as anything <=0

delete rows with “HQ QR”

#_________________REMOVE ROWS THAT CONTAIN "HQ QR"
all_trips19_20_v3 <- all_trips19_20_v2
        [!(all_trips19_20_v2$start_station_name == "HQ QR" |
           all_trips19_20_v2$end_station_name == "HQ QR"),]
dim(all_trips19_20_v3)
[1] 788188     10

add timeframes

The way the data is setup it would help down the road if we break down the started_at time into more detailed columns, such as Year, Quarter, Month, Day of the month, and day of the week. Reason being is when we perform our analysis we could look for patterns for weekdays, weekends, specific months or quarters.

#_______________LET'S BREAK started_at INTO >DATE,YEAR,QUARTER,MONTH(NUM),DAY(NUM),WEEKDAY(STRING)
#_______________ADD TIMEFRAMES

all_trips19_20_v3 <- all_trips19_20_v3 %>% 
  mutate(
     date = format(as.Date(started_at), format = "%m%d%Y"),   #monthdayYYYY 
     year = format(as.Date(started_at), format = "%Y"),       #Y > 1111 and y > 11
     quarter = quarters(started_at),                          #quarter
     month = format(as.Date(started_at), format = "%m"),      #months in number
     day = day(started_at),                                   #gives the day of the month in #
     day_of_week = weekdays(started_at),                      #text for the day 
  )

unique start stations

#__________________________LET'S SEE HOW MANY UNIQUE STARTING STATIONS WE HAVE
start_id <- all_trips19_20_v3 %>%
            #filter(year == 2020) %>% 
            distinct(start_station_id) %>% 
            arrange(start_station_id) %>% 
            rename(stationID = start_station_id)
#611 ROWS if we filter out 2019 we get 606 rows, so 5 stations were removed in 2020

unique end stations

#__________________________LET'S SEE HOW MANY UNIQUE ENDING STATIONS WE HAVE
end_id <-all_trips19_20_v3 %>%
         #filter(year == 2020) %>% 
         distinct(end_station_id) %>% 
         arrange(end_station_id) %>% 
         rename(stationID = end_station_id) 
#611 ROWS

compare stations

#_________________________LET'S SEE IF BOTH STATION LISTS ARE THE SAME
identical(start_id$start_station_id, end_id$end_station_id) #FALSE!
all.equal(start_id, end_id) #"Names: 1 string mismatch" "Component 1: Mean relative difference: 0.002392677" 
glimpse(end_id) #both are dbl
summary(compare_df_cols_same(start_id, end_id))
summary(compare_df_cols(start_id, end_id))

outer join lists

#________________OUTER JOIN THEM BECAUSE WE WANT ALL DISTINCT STATION ID FROM BOTH
join_station_id <- merge(start_id, end_id, by = 'stationID', all = TRUE) 
#613 ROWS

filter lists

#___________________FILTER OUT STARTING & ENDING STATION WITHOUT COORDINATES
station_coord <- all_trips19_20_v3 %>%
                 select( start_station_id,start_lat,start_lng) %>% 
                 filter(!is.na(start_lat), !is.na(start_lng)) %>% 
                 group_by(start_station_id, start_lat, start_lng) %>% 
                 # summarize(number_of_rides = n()) %>%
                 arrange((start_station_id)) %>% 
                 rename(stationID = start_station_id)

#606 ROWS it is less because some trips didn't have geo coordinates

uncommon stations

#____________________LETS SEE WHAT WE HAVE UNCOMMON BETWEEN THE TWO
anti_join(join_station_id, station_coord)
# and here below we get the 7 rows in stationID that are not in station_coord

# anti_join(join_station_id, station_coord)
# Joining with `by = join_by(stationID)`
# stationID
# 1 360
# 2 361
# 3 561
# 4 564
# 5 565
# 6 566
# 7 665

review uncommon ids

#____________________LET'S EXTRACT THOSE STATIONS FOR REVIEW
uncommon <- all_trips19_20_v3[all_trips19_20_v3$start_station_id == 360 |
                              all_trips19_20_v3$start_station_id == 361 |
                              all_trips19_20_v3$start_station_id == 561|
                              all_trips19_20_v3$start_station_id == 564|
                              all_trips19_20_v3$start_station_id == 565|
                              all_trips19_20_v3$start_station_id == 566|
                              all_trips19_20_v3$start_station_id == 665, ]
#OUTPUT
# 40 ROWS of trips without geolocations, all from 2019. 
# Those stations are no longer active in 2020 so we can ignore so we'll proceed with an inner join
# which will eliminate the uncommon ones

inner join lists

Here I’ll take the cleaned station list and join them with the stations with coordinates. Since some in the station list are no longer in use, using inner join will automatically eliminate those unwanted stations from the join_station_id list.

#_____________________________INNER JOIN TO COMBINE COORDINATES WITH ACTIVE STATIONS
all_active_stations_coord <- inner_join(join_station_id, station_coord, by ='stationID')
#____________________________DROP NUMBER OF RIDES COLUMN THAT WAS USED TO GROUP THEM
all_active_stations_coord <- all_active_stations_coord %>% 
select(-c(number_of_rides))

Save data

save cleaned data

Now we have two cleaned files, one with all the pertinent data we need and the other with the coordinates to every station in the first file.

#__________________________________SAVE CLEANED DATA________________________
file.create("rcleaned_bikeshare_q1_19_20.csv")
write_csv(all_trips19_20_v3,"rcleaned_bikeshare_q1_19_20.csv")
file.create("rcleaned_bikeshare_geolocation_20.csv")
write_csv(all_active_stations_coord,"rcleaned_bikeshare_geolocation_20.csv")

ANALYZE – R

Analyze with R

count rides/user

#______________________________LET'S SEE THE COUNT FOR EACH USER TYPE
table(all_trips19_20_v3$member_casual
  Customer Subscriber 
     67877     720311
The total of the two user types add up to 788188 which is what we had a couple sections above.
#_____________________________COULD BE DONE THESE OTHER WAYS
all_trips19_20_v3 %>% 
    group_by(member_casual) %>% 
    summarize(number_of_rides = n()) %>% 
    gt() %>% 
    gtsave( "total_per_user.png")
#___________________________________OR THIS WAY
all_trips19_20_v3 %>% 
    count(all_trips19_20_v3$member_casual) %>% 
    gt()

observation

I recommend tracking both with IDs so we can narrow in on how many users are more active than others so we can gather some insight on their use

If users are tracked via ID and zip code, then we’ll know how many of them are business commuters, casual users, or are actual tourists and just use it while in town. This way we can have a more focused marketing campaign.

summary

#________________MAX, MIN, MEAN, MEDIAN FOR ride_length
min_trip_hours = round(min(all_trips19_20_v3$ ride_length)/3600, 2) 
max_trip_minutes = round(max(all_trips19_20_v3$ ride_length)/60,2)
max_trip_hours = round(max(all_trips19_20_v3$ ride_length)/3600, 2)
median(all_trips19_20_v3$ ride_length)
mean(all_trips19_20_v3$ ride_length)
#_______________________INSTEAD I PREFER TO DO IT ALL IN ONE STATEMENT
summary(all_trips19_20_v3$ride_length )
> summary(all_trips19_20_v3$ride_length)
    Min.     1st Qu.  Median   Mean    3rd Qu.     Max. 
       1       331      539     1189      912      10632022

summarise

We’ll get a better idea of the breakdown if we group them by type of user and then calculate the statistics for each category

#______________________GROUP BY TYPE OF RIDER THEN CALCULATE STATISTICS
all_trips19_20_v3 %>% 
   group_by(member_casual) %>% 
      summarise(
           max    = max(ride_length),
           min    = min(ride_length),
           median = median(ride_length),
           mean   = mean(ride_length)
           ) %>% 
   gt()

rides per year

#_________________LOAD LIBRARY SCALES
#_______________TOTAL TRIPS FROM YEAR TO YEAR

library(scales)

trips_year <- trips_19_20 %>% 
          group_by(year) %>% 
          summarise(number_of_rides =n()) %>%
          mutate(pct_change = percent((number_of_rides - lag(number_of_rides))/lag(number_of_rides))) %>% 
          gt() %>% 
          tab_header(title =md("**Daily breakdown for rides/year** "))
gtsave( trips_year, "trips_year.png")

rides per year/user type

#___________________YEARLY CHANGE PER USER TYPE
user_year <- trips_19_20 %>% 
          group_by(member_casual, year) %>% 
          summarise(number_of_rides =n()) %>% 
          mutate(pct_change = percent((number_of_rides - lag(number_of_rides))/lag(number_of_rides))) %>%
          gt() %>% 
          tab_header(title =md("**Rides/year per user type** "))
gtsave( user_year, "user_per_year.png")

observation

The table above clearly shows that marketing has done a great job in attracting new Customers with a 93% increase.
Actually the Subscriber increase is very good as well
Once again the disparity between the total of rides by each user type is glaring and looking at the previous table that showed a 16% in total rides from 2019 to 2020, we can conclude that the 93% increase only contributed to 5% of the total rides.
It is imperative that we tie the number of users to the number of rides (which is not done at this time) to gain any meaningful insight from our analysis

total rides per day

Let’s see the breakdown of total rides per day per user type (member_casual)

#______________USE TABLE FOR BREAKDOWN TOTAL RIDES/DAY/USERTYPE
daily_table <- trips_19_20 %>% 
   group_by(member_casual, week_day) %>% 
      summarise(max = max(ride_length),
      min = min(ride_length),
      median = median(ride_length),
      mean = mean(ride_length),
      number_of_rides =n()) %>% 
#__________________DESIGN TABLE OUTPUT
   gt() %>% 
   tab_header(
              title =md("**Daily breakdown for both _user types_** ")
             ) %>% 
   fmt_number(columns = c(median,mean), decimals = 0) %>% 
   gtsave(daily_table, "daily_bkdwn.png")

condensed table

#________________CONDENSE TABLE FOR EASY READ
trips_19_20 %>%
          group_by(member_casual, week_day) %>%
          summarise(number_of_rides =n()) %>%
          gt() %>%
          tab_header(title =md("**Daily breakdown for both _user types_** "))

total rides per day/year

#_______________________LET'S SEE HOW TRIPS CHANGED PER DAY FROM YEAR TO YEAR
daily_yearly <- trips_19_20 %>% 
            group_by(member_casual,year, week_day) %>% 
            summarise(number_of_rides =n()) %>% 
            gt() %>% 
            tab_header(title =md("**Daily breakdown for both _user types year vs year** "))
gtsave( daily_yearly, "daily_yearly.png")

percent change rides/day

#_________________LET'S SEE PERCENTAGE CHANGE FROM YEAR TO YEAR PER DAY

daily_percent <- trips_19_20 %>% 
              group_by(member_casual, week_day, year) %>%
              summarise(number_of_rides =n()) %>% 
              mutate(pct_change = percent((number_of_rides - lag(number_of_rides))/lag(number_of_rides))) %>%
              gt() %>% 
              tab_header(title =md("**Daily percent change for both _user types year vs year** "))
gtsave( daily_percent, "daily_percent.png")

observations

FYI day 1 = Sunday, day 2 = Monday and so on

Customers user type:

Customer users are more active on Saturdays and Sundays which would cause one to believe that those could be tourists or visitors from one of the suburbs of Chicago.
As I noted before, tracking zip code for users or userId with a profile would quickly focus and clarify our path to a more insightful analyis
The most obvious percent increase is on day 1 which is Sunday
Saturday only had a 25% increase but yet still has the second most rides second to Sunday
I am not sure how you can convert weekend users to Subscribers if they are just visitors that might travel to the city on occasions, not knowing the pricing of a daily pass compared to a yearly subscription I need more information to be able to devise a logical effective marketing strategy for their conversion.
We also don’t know if any of the users are repeat users or all these are new cyclists that are exposed to our bikes.
If we are generating unique users it will be worth marketing to them with a different plan than the regular commuters that comprise the weekday usage as shown by the Subcriber type.

Subscriber user type:

Regardless of which year you review, the weekday use of this type of user overwhelms the weekend numbers
Thursday and Fridays showed a decline in rides but those numbers are still much higher than weekend numbers
Cause of these declines could be lack of bikes to use, how do we know how many bikes are at each station?
How do we know if a user proceeds to their regular station to pick up a bike and realizes none are available?
Does the app show bikes available at each station at all times?
Do we track how many times the user had to go to an alternate station due to the lack of availability of bikes?
Those are numbers that can easily be tracked with a more user friendly app.
Once again it comes down to UX, how do we know the frustration level of a user if they have to walk several blocks to another station, or worse yet take a cab or walk instead?

busiest stations

#________________STATIONS WITH THE LARGEST NUMBER OF RIDES STARTED
top_stations <- trips_19_20 %>% 
group_by(start_station_id) %>% 
summarise(number_of_rides =n()) %>% 
arrange(desc(number_of_rides))

top 5 starting stations

#________________TOP 5 STATIONS WHERE RIDES WERE INITIATED 
top_5 <- top_stations %>% 
slice(1:5)

observations

Interesting numbers, the top 3 stations individually have about 56% more rides than the 4th busiest station
I will zoom in on these 3 top stations and see what insight I can gather

top 3 change/year/day

#_________________LET'S SEE % CHANGE FROM YEAR TO YEAR PER DAY FOR THE TOP  3 STATIONS

rides_top_stations <-trips_19_20 %>%
                filter(start_station_id == 192 | start_station_id ==77 | start_station_id == 91 ) %>% #TOP 3 ONLY
                group_by(week_day,start_station_id,year) %>%
                summarise(number_of_rides =n()) %>% 
                mutate( pct_change = percent((number_of_rides - lag(number_of_rides))/lag(number_of_rides))) %>%
                arrange(desc(start_station_id)) %>% 
                gt() %>% 
                tab_header(title =md("**Top 3 stations: rides/day year/year** "))
gtsave( rides_top_stations, "rides_top_stations.png")

observations

Fist off let me explain how to quickly read the table:
- It is divided in 3 vertical sections
- you’ll see in column 1: the number 1 is for day of the week (1=Sunday), the starting station id # 192, then you’ll see two rows for 2019 and 2020, column 2 is number or rides, column3 is percent change from the year before for that same day and same station
The obvious: weekday numbers 2-6 are more than 10X the weekend day numbers
Station 192: weekday ride numbers have smaller percent increase from 2019
Station 91: every day has a decrease in ride numbers except for Sunday with the decreases being significant close to an 20% decrease on average
Station 77: saw decreases in from Thursday through Saturday with Wednesday having a 1% increase which is not significant
I’ll dig into the following next to see if it leads somewhere:
- There are about 4000 rides per WEEKDAY initiated from just these 3 stations.
- Is inventory being tracked at these stations, if so how? If not, are we relying on users to walk to nearby stations?
- Proximity of these 3 stations to one another
- First thing I worry about: are there enough bikes to rent out from these 3 locations?
- Is U/X the cause of the decline in rides?
- I will check to see how many rides terminate at these top 3 stations to see if inventory is being replenished by migration from other stations?
- I will check to see how many cyclist start and end their rides at these top 3 stations to see if that affects the supply of bikes

top 3 rides/2020

#________________DAILY RIDES STARTED FROM THE TOP 3 STATIONS FOR 2020
started_top3 <- trips_19_20 %>%
             filter(start_station_id == 192 | start_station_id ==77 | start_station_id == 91 & year == 2020 ) %>% 
             group_by(week_day,start_station_id) %>%
             summarise(number_of_rides =n()) %>% 
             arrange(desc(start_station_id)) %>% 
             gt()%>% 
             tab_header(
                       title =md("**Top 3 stations: rides/day for 2020** "),
                       subtitle = md("STARTING stations")
                       )
gtsave( started_top3, "started_top3s.png")

top 5 ending stations

#_____________________________ FIGURE OUT THE TOP ENDING STATIONS
ending_trips<- trips_19_20 %>%
               group_by(end_station_id) %>%
               summarise(number_of_rides =n()) %>%
               arrange(desc(number_of_rides))

#____________________________FILTER TO TOP 5 ENDING STATIONS AND PRINT A TABLE
end_5 <- ending_trips %>%
         slice(1:5) %>%
         gt() %>%
         tab_header(title =md("**Top 5 ENDING STATIONS** "))
gtsave( end_5, "end5.png")

chart rides per day

#_______________________________PLOT_________________________
trips_19_20 %>% 
   group_by(member_casual, week_day) %>% 
   summarise(number_of_rides =n()) %>% 
   ggplot(aes(x= week_day, y= number_of_rides, fill = member_casual, label = number_of_rides)) + 
        geom_bar(stat = "identity") + 
        geom_text(size = 3, position = position_stack(vjust = 0.5))+ 
        labs(title="Total rides per day per user type")

PROCESS – SQL

We already changed column names to match in both tables earlier in R, so let’s continue from there with SQL.

filter tables

join tables

SELECT
  CAST(ride_id AS STRING) AS ride_id,
  started_at,
  ended_at,
  start_station_id,
  start_station_name,
  member_casual
FROM`prime-depot-415622.BikeShare.trips19v1`
UNION ALL
SELECT
  ride_id,
  started_at,
  ended_at,
  start_station_id,
  start_station_name,
  member_casual
FROM `prime-depot-415622.BikeShare.trips20`

rename member_casual

SELECT
  CAST(ride_idASSTRING)ASride_id,
  started_at,
  ended_at,
  start_station_id,
  start_station_name,
  member_casual
FROM`prime-depot-415622.BikeShare.trips19v1`
UNION ALL
SELECT
  ride_id,
  started_at,
  ended_at,
  start_station_id,
  start_station_name,
  CASE
    WHEN  member_casual = 'member' THEN 'Subscriber'
    WHEN  member_casual = 'casual' THEN 'Customer'
    ELSE  member_casual
  END
FROM `prime-depot-415622.BikeShare.trips20`

verify trip count

Same as we did with R above in section table() to verify, let’s check to see how many trips we have for each type of member_casual. Once again I’ll color the new code (compared to the above) in green. The result displayed in the table are the same as what we had before with R up to this point.

We’ll create a subquery to enclose what we’ve done so far
Name the subquery -> subquery
Group and count the number of occurrences for each type of rider

SELECT
  subquery.member_casual,
  COUNT(*) AS trip_count
FROM
  ( 
  SELECT
    CAST(ride_id AS STRING) AS ride_id, 
    started_at,
    ended_at,
    start_station_id, 
    start_station_name,
     member_casual
  FROM `prime-depot-415622.BikeShare.trips19v1`
  UNION ALL
  SELECT
    ride_id,
    started_at,
    ended_at,
    start_station_id,
    start_station_name,
    CASE
      WHEN  member_casual = 'member' THEN 'Subscriber'
      WHEN  member_casual = 'casual' THEN 'Customer'
      ELSE  member_casual
    END
  FROM `prime-depot-415622.BikeShare.trips20`
  ) AS subquery
GROUP BY
  subquery.member_casual
OUTPUT
Subscriber  Customer      Total
720313         71643     791956

calculate ride_length

SELECT
  TIMESTAMP_DIFF(subquery.ended_at, subquery.started_at, SECOND) AS ride_length
FROM
  ( 
  SELECT
    CAST(ride_id AS STRING) AS ride_id, 
    started_at,
    ended_at,
    start_station_id, 
    start_station_name,
     member_casual
  FROM `prime-depot-415622.BikeShare.trips19v1`
  UNION ALL
  SELECT
    ride_id,
    started_at,
    ended_at,
    start_station_id,
    start_station_name,
    CASE
      WHEN  member_casual = 'member' THEN 'Subscriber'
      WHEN  member_casual = 'casual' THEN 'Customer'
      ELSE  member_casual
    END
  FROM `prime-depot-415622.BikeShare.trips20`
  ) AS subquery

filter out <=0 ride_length

SELECT
  TIMESTAMP_DIFF(subquery.ended_at, subquery.started_at, SECOND) AS ride_length
FROM
  ( 
  SELECT
    CAST(ride_id AS STRING) AS ride_id, 
    started_at,
    ended_at,
    start_station_id, 
    start_station_name,
     member_casual
  FROM `prime-depot-415622.BikeShare.trips19v1`
  UNION ALL
  SELECT
    ride_id,
    started_at,
    ended_at,
    start_station_id,
    start_station_name,
    CASE
      WHEN  member_casual = 'member' THEN 'Subscriber'
      WHEN  member_casual = 'casual' THEN 'Customer'
      ELSE  member_casual
    END
  FROM `prime-depot-415622.BikeShare.trips20`
  ) AS subquery
WHERE
  TIMESTAMP_DIFF(subquery.ended_at, subquery.started_at, SECOND)>0;

OUTPUT Total rows: 791746

look for “HQ QR”

It appears that 3557 rows have start_ or end_ AND ride_length > 0

start_station_name ="HQ QR" 
AND 
TIMESTAMP_DIFF(subquery.ended_at, subquery.started_at, SECOND) > 0

It appears that 3768

WHERE
  start_station_name ="HQ QR"
OR
  end_station_name = "HQ QR";

filter out “HQ QR”

SELECT
   TIMESTAMP_DIFF(subquery.ended_at, subquery.started_at, SECOND) AS ride_length
FROM
  ( 
  SELECT
    CAST(ride_id AS STRING) AS ride_id, 
    started_at,
    ended_at,
    start_station_id, 
    start_station_name,
    end_station_name,
    end_station_id,
    member_casual
  FROM `prime-depot-415622.BikeShare.trips19v1`
  UNION ALL
  SELECT
    ride_id,
    started_at,
    ended_at,
    start_station_id,
    start_station_name,
    end_station_name,
    end_station_id,
    CASE
      WHEN  member_casual = 'member' THEN 'Subscriber'
      WHEN  member_casual = 'casual' THEN 'Customer'
      ELSE  member_casual
    END
  FROM `prime-depot-415622.BikeShare.trips20`
  ) AS subquery
WHERE 
  (start_station_name <> "HQ QR" AND end_station_name <> "HQ QR")
  AND
  TIMESTAMP_DIFF(subquery.ended_at, subquery.started_at, SECOND) > 0;

OUTPUT Total rows: 788188

add timeframes

Let’s add the following columns so we can have the ability to expand our analysis. The way the data is setup now, all we have are trip_id and TIMESTAMP for start and finish.

 
SELECT
  EXTRACT(DAYOFWEEK FROM subquery.started_at) AS day_of_week,
  EXTRACT(MONTH FROM subquery.started_at) AS month,
  EXTRACT(YEAR FROM subquery.started_at) AS year,
  EXTRACT(QUARTER FROM subquery.started_at) AS quarter,
  EXTRACT(DATE FROM subquery.started_at) AS date,
  TIMESTAMP_DIFF(subquery.ended_at, subquery.started_at, SECOND) AS ride_length
FROM
  ( 
  SELECT
    CAST(ride_id AS STRING) AS ride_id, 
    started_at,
    ended_at,
    start_station_id, 
    start_station_name,
    end_station_name,
    end_station_id,
    member_casual
  FROM `prime-depot-415622.BikeShare.trips19v1`
  UNION ALL
  SELECT
    ride_id,
    started_at,
    ended_at,
    start_station_id,
    start_station_name,
    end_station_name,
    end_station_id,
    CASE
      WHEN  member_casual = 'member' THEN 'Subscriber'
      WHEN  member_casual = 'casual' THEN 'Customer'
      ELSE  member_casual
    END
  FROM `prime-depot-415622.BikeShare.trips20`
  ) AS subquery
WHERE 
  (start_station_name <> "HQ QR" AND end_station_name <> "HQ QR")
  AND
  TIMESTAMP_DIFF(subquery.ended_at, subquery.started_at, SECOND) > 0;

ANALYZE – SQL

Analyze with SQL

count rides/User

We create another subquery outside the other ones and note the code in green that was added to the previous block:

SELECT
   subquery2.member_casual,
   COUNT(*) AS trip_count
FROM
  (
  SELECT
     subquery.member_casual,
     EXTRACT(DAYOFWEEK FROM subquery.started_at) AS day_of_week,
     EXTRACT(MONTH FROM subquery.started_at) AS month,
     EXTRACT(YEAR FROM subquery.started_at) AS year,
     EXTRACT(QUARTER FROM subquery.started_at) AS quarter,
     EXTRACT(DATE FROM subquery.started_at) AS date,
     TIMESTAMP_DIFF(subquery.ended_at, subquery.started_at, SECOND) AS ride_length
  FROM
     ( 
     SELECT
       CAST(ride_id AS STRING) AS ride_id, 
       started_at,
       ended_at,
       start_station_id, 
       start_station_name,
       end_station_name,
       end_station_id,
       member_casual
     FROM `prime-depot-415622.BikeShare.trips19v1`
     UNION ALL
     SELECT
       ride_id,
       started_at,
       ended_at,
       start_station_id,
       start_station_name,
       end_station_name,
       end_station_id,
       CASE
         WHEN  member_casual = 'member' THEN 'Subscriber'
         WHEN  member_casual = 'casual' THEN 'Customer'
         ELSE  member_casual
       END
     FROM `prime-depot-415622.BikeShare.trips20`
     ) AS subquery
   WHERE 
     (start_station_name <> "HQ QR" AND end_station_name <> "HQ QR")
     AND
     TIMESTAMP_DIFF(subquery.ended_at, subquery.started_at, SECOND) > 0;
   ) AS subquery2
GROUP BY
subquery2.member_casual;

OUTPUT
member_casual      trip_count
Subscriber          720311 
Casual               67877

group & sort

Let’s group the number of trips by user (member_casual) and group by day of the week.

#___________________________We just add the code in green to previous block:
SELECT
   subquery2.member_casual,
   subquery2.day_of_week,
   COUNT(*) AS trip_count
FROM
  (
  SELECT
     subquery.member_casual,
     EXTRACT(DAYOFWEEK FROM subquery.started_at) AS day_of_week,
     EXTRACT(MONTH FROM subquery.started_at) AS month,
     EXTRACT(YEAR FROM subquery.started_at) AS year,
     EXTRACT(QUARTER FROM subquery.started_at) AS quarter,
     EXTRACT(DATE FROM subquery.started_at) AS date,
     TIMESTAMP_DIFF(subquery.ended_at, subquery.started_at, SECOND) AS ride_length
  FROM
     ( 
     SELECT
       CAST(ride_id AS STRING) AS ride_id, 
       started_at,
       ended_at,
       start_station_id, 
       start_station_name,
       end_station_name,
       end_station_id,
       member_casual
     FROM `prime-depot-415622.BikeShare.trips19v1`
     UNION ALL
     SELECT
       ride_id,
       started_at,
       ended_at,
       start_station_id,
       start_station_name,
       end_station_name,
       end_station_id,
       CASE
         WHEN  member_casual = 'member' THEN 'Subscriber'
         WHEN  member_casual = 'casual' THEN 'Customer'
         ELSE  member_casual
       END
     FROM `prime-depot-415622.BikeShare.trips20`
     ) AS subquery
   WHERE 
     (start_station_name <> "HQ QR" AND end_station_name <> "HQ QR")
     AND
     TIMESTAMP_DIFF(subquery.ended_at, subquery.started_at, SECOND) > 0;
   ) AS subquery2

GROUP BY
subquery2.member_casual,
subquery2.day_of_week

ORDER BY
subquery2.day_of_week,
subquery2.member_casual;

OUTPUT: I created the same output in various charts with Tableau, as shown below

ANALYZE – TABLEAU

Chart analyses

total rides per day

rides per day/year

Let’s see what has happened from year to year for the same quarter Q1. It appears that the number of rides have gone up for all days EXCEPT Thursday and Fridays, where we see the number of rides have decreased

total rides per year

location of top 5

top 5 start stations

what’s interesting about the chart that follows is:

top 5 v2

top 5 end stations

top5 start=end

breakdown by quarters

The results show an obvious discrepancy between Subscriber’s trip count on weekdays (which is much higher) than Customers. While on weekend days it’s the opposite.

Let’s see if the results change in different quarters WAIT, the data provided is for Q1 of both 2019 and 2020 so there is no need to check on that.

That’s one thing we need to do is load up all months or create a separate df for each quarter then do the comparison.

#________We just add the code in green to previous block
SELECT
   subquery2.quarter,
   subquery2.member_casual,
   subquery2.day_of_week,
   COUNT(*) AS trip_count
FROM
  (
  SELECT
     EXTRACT(DAYOFWEEK FROM subquery.started_at) AS day_of_week,
     EXTRACT(MONTH FROM subquery.started_at) AS month,
     EXTRACT(YEAR FROM subquery.started_at) AS year,
     EXTRACT(QUARTER FROM subquery.started_at) AS quarter,
     EXTRACT(DATE FROM subquery.started_at) AS date,
     TIMESTAMP_DIFF(subquery.ended_at, subquery.started_at, SECOND) AS ride_length
  FROM
     ( 
     SELECT
       CAST(ride_id AS STRING) AS ride_id, 
       started_at,
       ended_at,
       start_station_id, 
       start_station_name,
       end_station_name,
       end_station_id,
       member_casual
     FROM `prime-depot-415622.BikeShare.trips19v1`
     UNION ALL
     SELECT
       ride_id,
       started_at,
       ended_at,
       start_station_id,
       start_station_name,
       end_station_name, 
       end_station_id,
       CASE
         WHEN  member_casual = 'member' THEN 'Subscriber'
         WHEN  member_casual = 'casual' THEN 'Customer'
         ELSE  member_casual
       END
     FROM `prime-depot-415622.BikeShare.trips20`
     ) AS subquery
   WHERE 
     (start_station_name <> "HQ QR" AND end_station_name <> "HQ QR")
     AND
     TIMESTAMP_DIFF(subquery.ended_at, subquery.started_at, SECOND) > 0;
   ) AS subquery2

GROUP BY
subquery2.member_casual,
subquery2.day_of_week

ORDER BY
subquery2.quarter,
subquery2.day_of_week,
subquery2.member_casual;

same station rides

Let’s see how many rides ended at the location where they started:

WHERE
  subquery2.start_station_id = subquery2.end_station_id
OUTPUT   23626

Observations

If we group trip_count and sort it in order, a glaring figure jumps out at us: the top 3 stations in rides (by 50% more than the fourth place are in this order: Station # 192, 91, 77

1	Subscriber	192	13799
2	Subscriber	91	13434
3	Subscriber	77	12891
4	Subscriber	133	8720

If we add the other users to the data and calculate it for all users, we get

1	192	14155
2	91	13640
3	77	13362

As you see the Customer users account for:

356 at station 192 out of the total of 14155 trips

206 at station 91 out of the total of 13640 trips

471 at station 133 out of the total of 13362

observations

The company claims that “Cyclistic users are more likely to ride for leisure, but about 30% use the bikes to commute to work each day”, how is that numbered calculated?
The two plans offered to casual riders: single-ride passes and full-day passes are not tracked in the data provided, all we have is user type: “Customer” or “Subscriber”
The data provided doesn’t link any of the rides to an actual account. Why don’t we track by userId? Don’t Subscribers have to create an account? Why are we just sorting the data by “Member Casual”.
We can gather more insight if the rideId is linked to a userId, is the data available but not provided for analysis?
How are docking station locations related to subway, train, or bus stations?
Are docking stations near parking structures or major tourist, business centers?
Are users coming to the city via other mode of transportation and using bikes within the city for commuting?
The objective of the project is to find ways to convert “Customers” to “Subscribers” which is logical except the data provided doesn’t support the theory proposed by Mrs. Moreno. The data shows a large discrepancy between the two users, as Customers account to 9% of Subscribers usage. See table below.
The data does not show how many actual users are represented by each type. It is illogical to use rideId as userId. The data provided does not promote an insightful analysis regarding the particular hypothesis presented.
The idea that converting 9% of rides will improve the bottom line, sure but at what cost? How many users are we converting? How many Subscribers do we already have?
Considering the fact that I am not provided with data relevant to the proposed hypothesis, I would shift my focus on other issues that the data has exposed.
The facts that weekday ride numbers have been on the decrease for the busiest stations is alarming (see the last table below). The top stations are very close in proximity to one another, so it is possible that users are not getting good service or possibly a competitor has entered the arena and is targeting that specific small area where the top stations are located. Maybe inventory at those stations doesn’t support the volume of rides initiated from there?
The fact that inventory at those stations is not tracked needs to be addressed.
The top stations are far more important to the bottom line than wasting resources on the hypothesis that has been proposed, with the data provided. We cannot worry about converting 9% of rides while the top 3 stations are losing rides at a higher pace and by larger numbers than 9%.

RECOMMENDATIONS

- Modify the data we track.
- Implement userId so we can focus our analysis and be more insightful.
- Focus on UX at the top stations.
- Stabilize ride numbers at the top stations.
- Reassess the situation in the near future after modifications have been implemented.

SEARCH

WELLABEAT – CS1

Purpose

Deliverables

Business task

Scope/major activity

data collection

identify one product

create recommendations

Ask

Prepare

data

PROCESS – R

Cleaning with R

read data

observations

read data

observations

merge data

ANALYZE – R

Analyze with R

logins

weighins

logged & weighed

observations

ANALYZE – SQL

Analyze with SQL

unique Ids

weightLogIn entries

weightLogIn users

multiple entries

Share

observations

RECOMMENDATIONS

BIKESHARE – CS2

Purpose

Stakeholders

Deliverables

Prepare

data

load data

observations

PROCESS – R

Cleaning with R

rename

change type

vertical merge datasets

filter/Exclude Columns

replace with case_when()

table() to verify

Observations

calculate ride_length

drop column

check for ZERO values

check for <0 values

check for NA values

count trips of 0 length

count <0 values

identify <0 rows

identify ALL <0 values

extract ride_length <0 rows

extract ride_length = 0 rows

check df dimensions

delete <= 0 rows

extract “HQ QR” rows

delete rows with “HQ QR”

add timeframes

unique start stations

unique end stations

compare stations

outer join lists

filter lists

uncommon stations

review uncommon ids

inner join lists

Save data

save cleaned data

ANALYZE – R

Analyze with R

count rides/user