Contents

What follows is a case study that includes “Documentation of any cleaning or manipulation of data” as part of the deliverables so I’ve included all my step-by-step details.

I’ll start off with R then I’ll duplicate the process using SQL, and finally Tableau. Other analysis done in Python is not included and covered in other posts.

BIKESHARE CASE STUDY

A bike-share program that features more than 5,800 bicycles and 600 docking stations. Bikeshare sets itself apart by also offering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who can’t use a standard two-wheeled bike. The majority of riders opt for traditional bikes; about 8% of riders use the assistive options. Bikeshare users are more likely to ride for leisure, but about 30% use the bikes to commute to work each day.

In 2016, Bikeshare launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime. Until now, Bikeshare’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans:

  • single-ride passes
  • full-day passes and
  • annual memberships

Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are members. Bikeshare’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Bikeshare tract more customers, Moreno believes that maximizing the number of annual members will be key to future growth.
Rather than creating a marketing campaign that targets all-new customers, Moreno believes there is a solid opportunity to convert casual riders into members. She notes that casual riders are already aware of the Bikeshare program and have chosen Bikeshare for their mobility needs.

Purpose

The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Bikeshare bikes differently.

From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Bikeshare executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.

Moreno has set a clear goal:

Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the team needs to know

  • How do annual members and casual riders use bikes differently?
  • Why would casual riders buy annual memberships?
  • How can Bikeshare use digital media to influence casual riders to become members?
  • Moreno and her team are interested in analyzing the historical bike trip data to identify trends.

Stakeholders

  • Lily Moreno: The director of marketing and your manager. Moreno is responsible for the development of campaigns and initiatives to promote the bike-share program.
    These may include email, social media, and other channels.
  • Bikeshare marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide the marketing strategy.
    You joined this team six months ago and have been busy learning about Bikeshare’s mission and business goals—as well as how you, as a junior data analyst, can help Bikeshare achieve them.
  • The executive team: The notoriously detail-oriented executive team will decide whether to approve the recommended marketing program.

Deliverables

1. A clear statement of the business task
2. A description of all data sources used
3. Documentation of any cleaning or manipulation of data
4. A summary of your analysis
5. Supporting visualizations and key findings
6. Your top three recommendations based on your analysis

Prepare

data

Bikeshare’s historical trip data is found here: Downloadable here  

The data has been made available by Motivate International Inc. under this license.) This is public data that you can use to explore how different customer types are using Bikeshare bikes. But note that data-privacy issues prohibit you from using riders’ personally identifiable information.

Note: if using Posit’s RStudio, use the Divvy 2019 Q1 and Divvy 2020 Q1 datasets. Choosing other data might lead to errors because the data exceeds the memory available in the free plan.

load data
#______________________________INSTALL PACKAGES
install.packages("tidyverse")
install.packages("skimr")
install.packages("janitor")
install.packages("ggplot2")
install.packages("readr")
install.packages("readxl")
install.packages("gt")
install.packages("webshot2")
#______________________________LOAD LIBRARIES
library(tidyverse)
library(skimr)
library(janitor)
library(ggplot2)
library(lubridate)
library(stringr)
library(readr)
library(readxl)
library(dplyr)
library(gt)
library(webshot2)
#______________________READ DATA FROM Q1 2019 AND 2020
trips19 <- read_excel("BikeShare/BikeShareOriginalData/Divvy_Trips_2019_Q1.xlsx")
trips20 <- read_excel("BikeShare/BikeShareOriginalData/Divvy_Trips_2020_Q1.xlsx")
observations
  • The two plans offered to casual riders: single-ride passes and full-day passes are not tracked in the data provided all is tracked is whether the user is “Customer” or “Subscriber”.
  • The company claims that “Cyclistic users are more likely to ride for leisure, but about 30% use the bikes to commute to work each day”, how is that numbered calculated? nothing in the data shows how it can possibly be calculated!
  • UserId is not tracked!? How are we to guess how many users are using the system? This is basic data collection issue which makes the data broad and almost useless.
  • Birthyear and gender data is collected which could become helpful if we can figure out how many users we have.
  • Up to this point I pretty much know which direction I have to shift my analysis based on what’s provided in the data. This is not bias by any means, it is the simple understanding that the company lacks the knowledge of collecting the right data. It’s like trying to estimate the mpg for a car based on data collected on wiper fluid level.
  • At this point I’d have to take my concerns to my manager and advise them of the limitations of the data collected. I will also need to bring up the fact that her hypothesis  ” Moreno believes there is a solid opportunity to convert casual riders into members” will be impossible to be justified with this data since we don’t know the exact number of casual and member users.
  • Either additional data needs to be provided or the system needs to be adjusted to collect the userId.
  • Since the data is provided for Customers and Subscibers as a whole and not broken down to each user I’ll have to see if any
  • Column naming took a significant change in 2020 compared to 2019:
    • trip_id>ride_id, start_time>started_at, end_time>ended_at,bike_id & tripduration & gender & birthyear were all removed, from_station_id & from_station_name>start_station_name & start_station_id, to_station_id & to_station_name > end_station_id & end_station_name, usertype > member_casual
    • end_lat & end_lng & start_lat & start_lng were added

PROCESS WITH R

I usually detail my steps in “documents” for record keeping, in this project “Documentation of any cleaning or manipulation of data” is part of the deliverables so I’ve included it below.

Note: I will be using R, SQL, and Tableau for processing and analysis, most of the same steps done in R will be repeated in SQL

Cleaning with R

rename

#_____________________RENAME 2019 COLUMNS TO MATCH 2020
trips19 <- rename( trips19,
            ride_id = trip_id,
            started_at = start_time,
            ended_at = end_time,
            rideable_type = bikeid,
            start_station_name = from_station_name,
            start_station_id = from_station_id,
            end_station_name = to_station_name,
            end_station_id = to_station_id,
            member_casual = usertype)

change type

trips19 has ride_id as “number” while trips20 has it as “chr”, so let’s change trips19 to “char” so we can merge/bind both datasets together without any issues!

#____________________CONVERT ride_id & rideable_type FROM num TO char
trips19 <- mutate (trips19,
            ride_id = as.character(ride_id),
            rideable_type = as.character(rideable_type))

vertical merge datasets

rbind() is similar but REQUIRES equal number and identical columns for all datasets. So we use bind_rows() because trips19 has unique columns as well as common columns with trips20: tripduration, gender & birthyear that aren’t in trips20. On the other hand, trips20 has columns: start_lat, start_lng, end_lat, end_lng that don’t exist in trips19

#___________________COMBINE BOTH DATASETS VERTICALLY VIA  bind_rows()
trips19_20 <- bind_rows(trips19, trips20)

filter/Exclude Columns

#__________________FILTER NEW DATASETS PRIOR TO MANIPULATION - EXCLUDE CERTAIN COLUMNS
trim_trips19_20 <- subset(trips19_20,
                   select = -c( gender, birthyear))

#__________________OR YOU CAN DO IT THIS WAY AS WELL
trim_trips19_20 <- trips19_20 %>% 
                   select(-c( gender, birthyear))

replace with case_when()

member_casual column had values in the 19 dataset that we need to update to match the values we’ll use from 2020 and forward. We can use case_when or case_match. recode() has been replaced with case_match:

#_________________RENAME member_casual COLUMN FROM member & casual TO Subscriber & Customer
all_trips19_20 <- trim_trips19_20 %>% 
     mutate(member_casual = case_when(
        member_casual == "member" ~ "Subscriber",
        member_casual == "casual" ~ "Customer",
        member_casual == "Subscriber" ~ "Subscriber",
        TRUE ~ "Customer"
     ))

table() to verify

Just to make sure we ended up with ONLY two desired values and we didn’t miss anything we can use the colde below to check how many values exist in column member_casual:

#_________________CHECK TO SEE IF ALL WE HAVE ARE THE TWO DESIRED VALUES
table(all_trips19_20$member_casual)
OUTPUT
  Customer Subscriber 
     71643     720313 

Observations

  • My prior observations regarding the data is validated, as we can only gather the total number of rides taken by “Customers” vs “Subscribers”.
  • We can break down each type per station, per day, per hour, per gender, per age, but it’s impossible to know how many actual “riders” we are dealing with. How are we supposed to know how many “Customer” users are there?
  • How can we even imagine an entire marketing campaign in an attempt to convert users that we can’t quantify?
  • The data shows that 9.95% of the trips are taken by Customers!
  • Once again, not knowing how many acutal users derived all these rides it is NOT ADVISABLE to spend ANY resources trying to analyze nor devise a marketing strategy or campaign to convert any of the 9.95% of our rides.
  • My recommendation to my manager that we stop this analysis at this point until additional data is provided, or collection methods are revised.

calculate ride_length

difftime yields a result in drtn which looks like this in df: 390 secs, 441 secs,…. so I used as.numeric to convert it to a num which drops the secs string as well

#____________________CALCULATE ride_length AND DROP tripduration
all_trips19_20 <- all_trips19_20 %>% 
     mutate(ride_length = as.numeric(difftime(ended_at, started_at, units = 'secs')))

drop column

#____________________DROP tripduration COLUMN
all_trips19_20 <- all_trips19_20 %>% 
     select(-c(tripduration))

check for ZERO values

Let’s check to see what kind of values are in ride_length

#___________________CHECK FOR  VALUES==0 IN ride_length
any(all_trips19_20$ride_length == 0)
[1] TRUE

check for <0 values

#___________________CHECK FOR NEGATIVE VALUES IN ride_length
any(all_trips19_20$ride_length < 0)
[1] TRUE
#___________________ANOTHER WAY IS USING ALL
all(all_trips19_20$ride_length > 0)
[1] FALSE

check for NA values

#___________________CHECK FOR NA VALUES IN ride_length
any(is.na(all_trips19_20$ride_length))
[1] FALSE
#__________________OF COURSE WE CAN CHECK FOR MISSING DATA 
any(!is.na(all_trips19_20$ride_length))
[1] TRUE
#__________________OF COURSE WE CAN CHECK FOR NON-MISSING DATA < 0 or == 0
any(!is.na(all_trips19_20$ride_length) < 0)
[1] TRUE

count trips of 0 length

#___________________COUNT HOW MANY ride_length VALUES == 0
sum((all_trips19_20$ride_length == 0))
[1] 93

count <0 values

#___________________COUNT HOW MANY NEGATIVE ride_length VALUES WE HAVE
sum((all_trips19_20$ride_length < 0))
[1] 117

identify <0 rows

You could install another package if you want to identify the rows that contain the negative ride_length.  

#______________________IDENTIFY ROWS WITH NEGATIVE ride_length
install.packages("fixr")
library(fixr)

check_for_negative_values(all_trips19_20$ride_length)
Data frame contains negative values.
  [1] 520402 648667 651963 652290 653845 655288 655654 656362 657167 658842 
 [19] 669752 678262 678499 678889 679182 679252 680301 680757 682576 682662 
 [37] 694476 694635 696693 697774 698938 699115 699204 699450 699619 699855 
 [55] 714612 714687 714761 715171 715474 719275 719680 725559 726496 730499 
 [73] 742697 743604 743619 745898 747194 747233 748686 754006 754853 755032 
 [91] 763358 766631 767985 768102 768693 768833 772284 772868 774788 775777 
[109] 779170 779496 781519 781699 786503 786687 788671 788946 789252

As you see if we specifically target the ride_length column we get a list of all the rows that contain a negative in that specific column

identify ALL <0 values

Let’s see if any other NEGATIVE values exist in the entire dataset. The output displays all the rows where a NEGATIVE value appears in a specific column. As you can see we have 117 rows, and the only column with negative values is column 10 which happens to be the ride_length column.

#_____________________IDENTIFY ANY NEGATIVE VALUES IN DF
check_for_negative_values(all_trips19_20)
Data frame contains negative values.
          row col
  [1,] 520402  10
  [2,] 648667  10
  [3,] 651963  10
  [4,] 652290  10
  ...
[115,] 788671  10
[116,] 788946  10
[117,] 789252  10

extract ride_length <0 rows

Let’s see what stands out in these rows with NEGATIVE ride_length values. So let’s create a new df with just those rows

#_____________________LET'S CREATE A DF WITH JUST THE NEGATIVE ride_length FOR REVIEW
neg_length <- all_trips19_20[all_trips19_20$ride_length < 0, ]
View(neg_length)

the new df shows 117 rows and:

  • all start_station_name and end_station_name == HQ QR
  • of course all start_station_id and end_station_id are all the same == 675
  • After further research I realize that these were bikes taken out of circulation for quality control, maintenance…
  • So we need to remove these rows from the dataset

extract ride_length = 0  rows

#_____________________LET'S CREATE A DF WITH JUST THE ZERO ride_length FOR REVIEW
zero_length <- all_trips19_20[all_trips19_20$ride_length == 0, ]
View(zero_length)

Similar to the NEGATIVE value rows all observations listed above apply here as well, so we’ll be removing these rows from the dataset as well. There are 93 rows in this df.

check df dimensions

Before we delete any rows, let’s check the dimensions of the df

#_____________________CHECK DF DIMENSIONS
dim(all_trips19_20)
[1] 791956     10

delete <= 0 rows

I will not delete any rows, instead I’ll create a new df excluding the rows with <=0 length, or if you want to go further and include another logical condition OR | where station names are = HQ QR you can be my guest. I will not do it, instead I will check the new df afterwards for station names == HQ QR and deal with that at the time if they exist. I believe including that condition in this step might cause results I am not ready for yet. We know for sure we only have 117+93= 210 rows so let’s focus on that now.

#____________________CREATE A NEW DF WITH POSITIVE ride_length
all_trips19_20_v2 <- all_trips19_20 [all_trips19_20$ride_length > 0,]
dim(all_trips19_20_v2)
[1] 791746     10

extract “HQ QR” rows

Now that we’ve taken out all rows that had a ride_length <= 0,  let’s go back and see if there are any rows where start_station_name or end_station_name columns had values of “HQ QR”.

#____________________LET'S CHECK TO SEE IF "HQ QR" APPEARS IN ANY *_station_name COLUMNS
hq_anywhere <- all_trips19_20_v2 %>% 
     filter_all(all_vars(start_station_name == "HQ QR" | end_station_name == "HQ QR"))
View(hq_anywhere)

It appears that 3558 rows have HQ QR and their ride_length is >0.  So we need to remove them as well as anything <=0

delete rows with “HQ QR”

#_________________REMOVE ROWS THAT CONTAIN "HQ QR"
all_trips19_20_v3 <- all_trips19_20_v2
        [!(all_trips19_20_v2$start_station_name == "HQ QR" |
           all_trips19_20_v2$end_station_name == "HQ QR"),]
dim(all_trips19_20_v3)
[1] 788188     10

add timeframes

The way the data is setup it would help down the road if we break down the started_at time into more detailed columns, such as Year, Quarter, Month, Day of the month, and day of the week. Reason being is when we perform our analysis we could look for patterns for weekdays, weekends, specific months or quarters.

#_______________LET'S BREAK started_at INTO >DATE,YEAR,QUARTER,MONTH(NUM),DAY(NUM),WEEKDAY(STRING)
#_______________ADD TIMEFRAMES

all_trips19_20_v3 <- all_trips19_20_v3 %>% 
  mutate(
     date = format(as.Date(started_at), format = "%m%d%Y"),   #monthdayYYYY 
     year = format(as.Date(started_at), format = "%Y"),       #Y > 1111 and y > 11
     quarter = quarters(started_at),                          #quarter
     month = format(as.Date(started_at), format = "%m"),      #months in number
     day = day(started_at),                                   #gives the day of the month in #
     day_of_week = weekdays(started_at),                      #text for the day 
  )

unique start stations

#__________________________LET'S SEE HOW MANY UNIQUE STARTING STATIONS WE HAVE
start_id <- all_trips19_20_v3 %>%
            #filter(year == 2020) %>% 
            distinct(start_station_id) %>% 
            arrange(start_station_id) %>% 
            rename(stationID = start_station_id)
#611 ROWS if we filter out 2019 we get 606 rows, so 5 stations were removed in 2020

unique end stations

#__________________________LET'S SEE HOW MANY UNIQUE ENDING STATIONS WE HAVE
end_id <-all_trips19_20_v3 %>%
         #filter(year == 2020) %>% 
         distinct(end_station_id) %>% 
         arrange(end_station_id) %>% 
         rename(stationID = end_station_id) 
#611 ROWS

compare stations

#_________________________LET'S SEE IF BOTH STATION LISTS ARE THE SAME
identical(start_id$start_station_id, end_id$end_station_id) #FALSE!
all.equal(start_id, end_id) #"Names: 1 string mismatch" "Component 1: Mean relative difference: 0.002392677" 
glimpse(end_id) #both are dbl
summary(compare_df_cols_same(start_id, end_id))
summary(compare_df_cols(start_id, end_id))

outer join lists

#________________OUTER JOIN THEM BECAUSE WE WANT ALL DISTINCT STATION ID FROM BOTH
join_station_id <- merge(start_id, end_id, by = 'stationID', all = TRUE) 
#613 ROWS

filter lists

#___________________FILTER OUT STARTING & ENDING STATION WITHOUT COORDINATES
station_coord <- all_trips19_20_v3 %>%
                 select( start_station_id,start_lat,start_lng) %>% 
                 filter(!is.na(start_lat), !is.na(start_lng)) %>% 
                 group_by(start_station_id, start_lat, start_lng) %>% 
                 # summarize(number_of_rides = n()) %>%
                 arrange((start_station_id)) %>% 
                 rename(stationID = start_station_id)

#606 ROWS it is less because some trips didn't have geo coordinates

uncommon stations

#____________________LETS SEE WHAT WE HAVE UNCOMMON BETWEEN THE TWO
anti_join(join_station_id, station_coord)
# and here below we get the 7 rows in stationID that are not in station_coord

# anti_join(join_station_id, station_coord)
# Joining with `by = join_by(stationID)`
# stationID
# 1 360
# 2 361
# 3 561
# 4 564
# 5 565
# 6 566
# 7 665

review uncommon ids

#____________________LET'S EXTRACT THOSE STATIONS FOR REVIEW
uncommon <- all_trips19_20_v3[all_trips19_20_v3$start_station_id == 360 |
                              all_trips19_20_v3$start_station_id == 361 |
                              all_trips19_20_v3$start_station_id == 561|
                              all_trips19_20_v3$start_station_id == 564|
                              all_trips19_20_v3$start_station_id == 565|
                              all_trips19_20_v3$start_station_id == 566|
                              all_trips19_20_v3$start_station_id == 665, ]
#OUTPUT
# 40 ROWS of trips without geolocations, all from 2019. 
# Those stations are no longer active in 2020 so we can ignore so we'll proceed with an inner join
# which will eliminate the uncommon ones

inner join lists

Here I’ll take the cleaned station list and join them with the stations with coordinates. Since some in the station list are no longer in use, using inner join will automatically eliminate those unwanted stations from the join_station_id list.

#_____________________________INNER JOIN TO COMBINE COORDINATES WITH ACTIVE STATIONS
all_active_stations_coord <- inner_join(join_station_id, station_coord, by ='stationID')
#____________________________DROP NUMBER OF RIDES COLUMN THAT WAS USED TO GROUP THEM
all_active_stations_coord <- all_active_stations_coord %>% 
select(-c(number_of_rides))

Save data

save cleaned data

Now we have two cleaned files, one with all the pertinent data we need and the other with the coordinates to every station in the first file.

#__________________________________SAVE CLEANED DATA________________________
file.create("rcleaned_bikeshare_q1_19_20.csv")
write_csv(all_trips19_20_v3,"rcleaned_bikeshare_q1_19_20.csv")
file.create("rcleaned_bikeshare_geolocation_20.csv")
write_csv(all_active_stations_coord,"rcleaned_bikeshare_geolocation_20.csv")

ANALYZE WITH R

Analyze with R

count rides/user

#______________________________LET'S SEE THE COUNT FOR EACH USER TYPE
table(all_trips19_20_v3$member_casual
  Customer Subscriber 
     67877     720311
The total of the two user types add up to 788188 which is what we had a couple sections above.
#_____________________________COULD BE DONE THESE OTHER WAYS
all_trips19_20_v3 %>% 
    group_by(member_casual) %>% 
    summarize(number_of_rides = n()) %>% 
    gt() %>% 
    gtsave( "total_per_user.png")
#___________________________________OR THIS WAY
all_trips19_20_v3 %>% 
    count(all_trips19_20_v3$member_casual) %>% 
    gt()

observation
  • I recommend tracking both with IDs so we can narrow in on how many users are more active than others so we can gather some insight on their use
  • If users are tracked via ID and zip code, then we’ll know how many of them are business commuters, casual users, or are actual tourists and just use it while in town. This way we can have a more focused marketing campaign.

summary

#________________MAX, MIN, MEAN, MEDIAN FOR ride_length
min_trip_hours = round(min(all_trips19_20_v3$ ride_length)/3600, 2) 
max_trip_minutes = round(max(all_trips19_20_v3$ ride_length)/60,2)
max_trip_hours = round(max(all_trips19_20_v3$ ride_length)/3600, 2)
median(all_trips19_20_v3$ ride_length)
mean(all_trips19_20_v3$ ride_length)
#_______________________INSTEAD I PREFER TO DO IT ALL IN ONE STATEMENT
summary(all_trips19_20_v3$ride_length )
> summary(all_trips19_20_v3$ride_length)
    Min.     1st Qu.  Median   Mean    3rd Qu.     Max. 
       1       331      539     1189      912      10632022 

summarise

We’ll get a better idea of the breakdown if we group them by type of user and then calculate the statistics for each category

#______________________GROUP BY TYPE OF RIDER THEN CALCULATE STATISTICS
all_trips19_20_v3 %>% 
   group_by(member_casual) %>% 
      summarise(
           max    = max(ride_length),
           min    = min(ride_length),
           median = median(ride_length),
           mean   = mean(ride_length)
           ) %>% 
   gt()

rides per year

#_________________LOAD LIBRARY SCALES
#_______________TOTAL TRIPS FROM YEAR TO YEAR

library(scales)

trips_year <- trips_19_20 %>% 
          group_by(year) %>% 
          summarise(number_of_rides =n()) %>%
          mutate(pct_change = percent((number_of_rides - lag(number_of_rides))/lag(number_of_rides))) %>% 
          gt() %>% 
          tab_header(title =md("**Daily breakdown for rides/year** "))
gtsave( trips_year, "trips_year.png")

rides per year/user type

#___________________YEARLY CHANGE PER USER TYPE
user_year <- trips_19_20 %>% 
          group_by(member_casual, year) %>% 
          summarise(number_of_rides =n()) %>% 
          mutate(pct_change = percent((number_of_rides - lag(number_of_rides))/lag(number_of_rides))) %>%
          gt() %>% 
          tab_header(title =md("**Rides/year per user type** "))
gtsave( user_year, "user_per_year.png")

observation
  • The table above clearly shows that marketing has done a great job in attracting new Customers with a 93% increase.
  • Actually the Subscriber increase is very good as well
  • Once again the disparity between the total of rides by each user type is glaring and looking at the previous table that showed a 16% in total rides from 2019 to 2020, we can conclude that the 93% increase only contributed to 5% of the total rides.
  • It is imperative that we tie the number of users to the number of rides (which is not done at this time) to gain any meaningful insight from our analysis

total rides per day

Let’s see the breakdown of total rides per day per user type (member_casual)

#______________USE TABLE FOR BREAKDOWN TOTAL RIDES/DAY/USERTYPE
daily_table <- trips_19_20 %>% 
   group_by(member_casual, week_day) %>% 
      summarise(max = max(ride_length),
      min = min(ride_length),
      median = median(ride_length),
      mean = mean(ride_length),
      number_of_rides =n()) %>% 
#__________________DESIGN TABLE OUTPUT
   gt() %>% 
   tab_header(
              title =md("**Daily breakdown for both _user types_** ")
             ) %>% 
   fmt_number(columns = c(median,mean), decimals = 0) %>% 
   gtsave(daily_table, "daily_bkdwn.png")

condensed table

#________________CONDENSE TABLE FOR EASY READ
trips_19_20 %>%
          group_by(member_casual, week_day) %>%
          summarise(number_of_rides =n()) %>%
          gt() %>%
          tab_header(title =md("**Daily breakdown for both _user types_** "))

total rides per day/year

#_______________________LET'S SEE HOW TRIPS CHANGED PER DAY FROM YEAR TO YEAR
daily_yearly <- trips_19_20 %>% 
            group_by(member_casual,year, week_day) %>% 
            summarise(number_of_rides =n()) %>% 
            gt() %>% 
            tab_header(title =md("**Daily breakdown for both _user types year vs year** "))
gtsave( daily_yearly, "daily_yearly.png")

percent change rides/day

#_________________LET'S SEE PERCENTAGE CHANGE FROM YEAR TO YEAR PER DAY

daily_percent <- trips_19_20 %>% 
              group_by(member_casual, week_day, year) %>%
              summarise(number_of_rides =n()) %>% 
              mutate(pct_change = percent((number_of_rides - lag(number_of_rides))/lag(number_of_rides))) %>%
              gt() %>% 
              tab_header(title =md("**Daily percent change for both _user types year vs year** "))
gtsave( daily_percent, "daily_percent.png")

observations

FYI day 1 = Sunday, day 2 = Monday and so on

Customers user type:

  • Customer users are more active on Saturdays and Sundays which would cause one to believe that those could be tourists or visitors from one of the suburbs of Chicago.
  • As I noted before, tracking zip code for users or userId with a profile would quickly focus and clarify our path to a more insightful analyis
  • The most obvious percent increase is on day 1 which is Sunday
  • Saturday only had a 25% increase but yet still has the second most rides second to Sunday
  • I am not sure how you can convert weekend users to Subscribers if they are just visitors that might travel to the city on occasions, not knowing the pricing of a daily pass compared to a yearly subscription I need more information to be able to devise a logical effective marketing strategy for their conversion.
  • We also don’t know if any of the users are repeat users or all these are new cyclists that are exposed to our bikes.
  • If we are generating unique users it will be worth marketing to them with a different plan than the regular commuters that comprise the weekday usage as shown by the Subcriber type.

Subscriber user type:

  • Regardless of which year you review, the weekday use of this type of user overwhelms the weekend numbers
  • Thursday and Fridays showed a decline in rides but those numbers are still much higher than weekend numbers
  • Cause of these declines could be lack of bikes to use, how do we know how many bikes are at each station?
  • How do we know if a user proceeds to their regular station to pick up a bike and realizes none are available?
  • Does the app show bikes available at each station at all times?
  • Do we track how many times the user had to go to an alternate station due to the lack of availability of bikes?
  • Those are numbers that can easily be tracked with a more user friendly app.
  • Once again it comes down to UX, how do we know the frustration level of a user if they have to walk several blocks to another station, or worse yet take a cab or walk instead?

busiest stations

#________________STATIONS WITH THE LARGEST NUMBER OF RIDES STARTED
top_stations <- trips_19_20 %>% 
group_by(start_station_id) %>% 
summarise(number_of_rides =n()) %>% 
arrange(desc(number_of_rides))

top 5 starting stations

#________________TOP 5 STATIONS WHERE RIDES WERE INITIATED 
top_5 <- top_stations %>% 
slice(1:5)

observations
  • Interesting numbers, the top 3 stations individually have about 56% more rides than the 4th busiest station
  • I will zoom in on these 3 top stations and see what insight I can gather

top 3 change/year/day

#_________________LET'S SEE % CHANGE FROM YEAR TO YEAR PER DAY FOR THE TOP  3 STATIONS

rides_top_stations <-trips_19_20 %>%
                filter(start_station_id == 192 | start_station_id ==77 | start_station_id == 91 ) %>% #TOP 3 ONLY
                group_by(week_day,start_station_id,year) %>%
                summarise(number_of_rides =n()) %>% 
                mutate( pct_change = percent((number_of_rides - lag(number_of_rides))/lag(number_of_rides))) %>%
                arrange(desc(start_station_id)) %>% 
                gt() %>% 
                tab_header(title =md("**Top 3 stations: rides/day year/year** "))
gtsave( rides_top_stations, "rides_top_stations.png")

observations
  • Fist off let me explain how to quickly read the table:
    • It is divided in 3 vertical sections
    • you’ll see in column 1: the number 1 is for day of the week (1=Sunday), the starting station id # 192, then you’ll see two rows for 2019 and 2020, column 2 is number or rides, column3 is percent change from the year before for that same day and same station
  • The obvious: weekday numbers 2-6 are more than 10X the weekend day numbers
  • Station 192: weekday ride numbers have smaller percent increase from 2019
  • Station 91: every day has a decrease in ride numbers except for Sunday with the decreases being significant close to an 20% decrease on average
  • Station 77: saw decreases in from Thursday through Saturday with Wednesday having a 1% increase which is not significant
  • I’ll dig into the following next to see if it leads somewhere:
    • There are about 4000 rides per WEEKDAY initiated from just these 3 stations.
    • Is inventory being tracked at these stations, if so how? If not, are we relying on users to walk to nearby stations?
    • Proximity of these 3 stations to one another
    • First thing I worry about: are there enough bikes to rent out from these 3 locations?
    • Is U/X the cause of the decline in rides?
    • I will check to see how many rides terminate at these top 3 stations to see if inventory is being replenished by migration from other stations?
    • I will check to see how many cyclist start and end their rides at these top 3 stations to see if that affects the supply of bikes

top 3 rides/2020

#________________DAILY RIDES STARTED FROM THE TOP 3 STATIONS FOR 2020
started_top3 <- trips_19_20 %>%
             filter(start_station_id == 192 | start_station_id ==77 | start_station_id == 91 & year == 2020 ) %>% 
             group_by(week_day,start_station_id) %>%
             summarise(number_of_rides =n()) %>% 
             arrange(desc(start_station_id)) %>% 
             gt()%>% 
             tab_header(
                       title =md("**Top 3 stations: rides/day for 2020** "),
                       subtitle = md("STARTING stations")
                       )
gtsave( started_top3, "started_top3s.png")

top 5 ending stations

#_____________________________ FIGURE OUT THE TOP ENDING STATIONS
ending_trips<- trips_19_20 %>%
               group_by(end_station_id) %>%
               summarise(number_of_rides =n()) %>%
               arrange(desc(number_of_rides))

#____________________________FILTER TO TOP 5 ENDING STATIONS AND PRINT A TABLE
end_5 <- ending_trips %>%
         slice(1:5) %>%
         gt() %>%
         tab_header(title =md("**Top 5 ENDING STATIONS** "))
gtsave( end_5, "end5.png")

chart rides per day

#_______________________________PLOT_________________________
trips_19_20 %>% 
   group_by(member_casual, week_day) %>% 
   summarise(number_of_rides =n()) %>% 
   ggplot(aes(x= week_day, y= number_of_rides, fill = member_casual, label = number_of_rides)) + 
        geom_bar(stat = "identity") + 
        geom_text(size = 3, position = position_stack(vjust = 0.5))+ 
        labs(title="Total rides per day per user type")

PROCESS WITH SQL

We already changed column names to  match in both tables earlier in R, so let’s continue from there with SQL.

filter tables

join tables

SELECT
  CAST(ride_id AS STRING) AS ride_id,
  started_at,
  ended_at,
  start_station_id,
  start_station_name,
  member_casual
FROM`prime-depot-415622.BikeShare.trips19v1`
UNION ALL
SELECT
  ride_id,
  started_at,
  ended_at,
  start_station_id,
  start_station_name,
  member_casual
FROM `prime-depot-415622.BikeShare.trips20`

rename member_casual

SELECT
  CAST(ride_idASSTRING)ASride_id,
  started_at,
  ended_at,
  start_station_id,
  start_station_name,
  member_casual
FROM`prime-depot-415622.BikeShare.trips19v1`
UNION ALL
SELECT
  ride_id,
  started_at,
  ended_at,
  start_station_id,
  start_station_name,
  CASE
    WHEN  member_casual = 'member' THEN 'Subscriber'
    WHEN  member_casual = 'casual' THEN 'Customer'
    ELSE  member_casual
  END
FROM `prime-depot-415622.BikeShare.trips20`

verify trip count

Same as we did with R above in section table() to verify, let’s check to see how many trips we have for each type of member_casual. Once again I’ll color the new code (compared to the above) in green. The result displayed in the table are the same as what we had before with R up to this point.

  • We’ll create a subquery to enclose what we’ve done so far
  • Name the subquery -> subquery
  • Group and count the number of occurrences for each type of rider
SELECT
  subquery.member_casual,
  COUNT(*) AS trip_count
FROM
  ( 
  SELECT
    CAST(ride_id AS STRING) AS ride_id, 
    started_at,
    ended_at,
    start_station_id, 
    start_station_name,
     member_casual
  FROM `prime-depot-415622.BikeShare.trips19v1`
  UNION ALL
  SELECT
    ride_id,
    started_at,
    ended_at,
    start_station_id,
    start_station_name,
    CASE
      WHEN  member_casual = 'member' THEN 'Subscriber'
      WHEN  member_casual = 'casual' THEN 'Customer'
      ELSE  member_casual
    END
  FROM `prime-depot-415622.BikeShare.trips20`
  ) AS subquery
GROUP BY
  subquery.member_casual
OUTPUT
Subscriber  Customer      Total
720313         71643     791956

calculate ride_length

SELECT
  TIMESTAMP_DIFF(subquery.ended_at, subquery.started_at, SECOND) AS ride_length
FROM
  ( 
  SELECT
    CAST(ride_id AS STRING) AS ride_id, 
    started_at,
    ended_at,
    start_station_id, 
    start_station_name,
     member_casual
  FROM `prime-depot-415622.BikeShare.trips19v1`
  UNION ALL
  SELECT
    ride_id,
    started_at,
    ended_at,
    start_station_id,
    start_station_name,
    CASE
      WHEN  member_casual = 'member' THEN 'Subscriber'
      WHEN  member_casual = 'casual' THEN 'Customer'
      ELSE  member_casual
    END
  FROM `prime-depot-415622.BikeShare.trips20`
  ) AS subquery

filter out <=0 ride_length

SELECT
  TIMESTAMP_DIFF(subquery.ended_at, subquery.started_at, SECOND) AS ride_length
FROM
  ( 
  SELECT
    CAST(ride_id AS STRING) AS ride_id, 
    started_at,
    ended_at,
    start_station_id, 
    start_station_name,
     member_casual
  FROM `prime-depot-415622.BikeShare.trips19v1`
  UNION ALL
  SELECT
    ride_id,
    started_at,
    ended_at,
    start_station_id,
    start_station_name,
    CASE
      WHEN  member_casual = 'member' THEN 'Subscriber'
      WHEN  member_casual = 'casual' THEN 'Customer'
      ELSE  member_casual
    END
  FROM `prime-depot-415622.BikeShare.trips20`
  ) AS subquery
WHERE
  TIMESTAMP_DIFF(subquery.ended_at, subquery.started_at, SECOND)>0;

OUTPUT Total rows: 791746

look for “HQ QR”

  • It appears that 3557 rows have start_ or end_ AND ride_length > 0
start_station_name ="HQ QR" 
AND 
TIMESTAMP_DIFF(subquery.ended_at, subquery.started_at, SECOND) > 0
  • It appears that 3768
WHERE
  start_station_name ="HQ QR"
OR
  end_station_name = "HQ QR";

filter out “HQ QR”

SELECT
   TIMESTAMP_DIFF(subquery.ended_at, subquery.started_at, SECOND) AS ride_length
FROM
  ( 
  SELECT
    CAST(ride_id AS STRING) AS ride_id, 
    started_at,
    ended_at,
    start_station_id, 
    start_station_name,
    end_station_name,
    end_station_id,
    member_casual
  FROM `prime-depot-415622.BikeShare.trips19v1`
  UNION ALL
  SELECT
    ride_id,
    started_at,
    ended_at,
    start_station_id,
    start_station_name,
    end_station_name,
    end_station_id,
    CASE
      WHEN  member_casual = 'member' THEN 'Subscriber'
      WHEN  member_casual = 'casual' THEN 'Customer'
      ELSE  member_casual
    END
  FROM `prime-depot-415622.BikeShare.trips20`
  ) AS subquery
WHERE 
  (start_station_name <> "HQ QR" AND end_station_name <> "HQ QR")
  AND
  TIMESTAMP_DIFF(subquery.ended_at, subquery.started_at, SECOND) > 0;

OUTPUT Total rows: 788188

add timeframes

Let’s add the following columns so we can have the ability to  expand our  analysis. The  way the  data is  setup now, all we  have are trip_id and TIMESTAMP for start and finish.

 
SELECT
  EXTRACT(DAYOFWEEK FROM subquery.started_at) AS day_of_week,
  EXTRACT(MONTH FROM subquery.started_at) AS month,
  EXTRACT(YEAR FROM subquery.started_at) AS year,
  EXTRACT(QUARTER FROM subquery.started_at) AS quarter,
  EXTRACT(DATE FROM subquery.started_at) AS date,
  TIMESTAMP_DIFF(subquery.ended_at, subquery.started_at, SECOND) AS ride_length
FROM
  ( 
  SELECT
    CAST(ride_id AS STRING) AS ride_id, 
    started_at,
    ended_at,
    start_station_id, 
    start_station_name,
    end_station_name,
    end_station_id,
    member_casual
  FROM `prime-depot-415622.BikeShare.trips19v1`
  UNION ALL
  SELECT
    ride_id,
    started_at,
    ended_at,
    start_station_id,
    start_station_name,
    end_station_name,
    end_station_id,
    CASE
      WHEN  member_casual = 'member' THEN 'Subscriber'
      WHEN  member_casual = 'casual' THEN 'Customer'
      ELSE  member_casual
    END
  FROM `prime-depot-415622.BikeShare.trips20`
  ) AS subquery
WHERE 
  (start_station_name <> "HQ QR" AND end_station_name <> "HQ QR")
  AND
  TIMESTAMP_DIFF(subquery.ended_at, subquery.started_at, SECOND) > 0;

ANALYZE WITH SQL

Analyze with SQL

count rides/User

We create another subquery outside the other ones and note the code in green that was added to  the previous block:

SELECT
   subquery2.member_casual,
   COUNT(*) AS trip_count
FROM
  (
  SELECT
     subquery.member_casual,
     EXTRACT(DAYOFWEEK FROM subquery.started_at) AS day_of_week,
     EXTRACT(MONTH FROM subquery.started_at) AS month,
     EXTRACT(YEAR FROM subquery.started_at) AS year,
     EXTRACT(QUARTER FROM subquery.started_at) AS quarter,
     EXTRACT(DATE FROM subquery.started_at) AS date,
     TIMESTAMP_DIFF(subquery.ended_at, subquery.started_at, SECOND) AS ride_length
  FROM
     ( 
     SELECT
       CAST(ride_id AS STRING) AS ride_id, 
       started_at,
       ended_at,
       start_station_id, 
       start_station_name,
       end_station_name,
       end_station_id,
       member_casual
     FROM `prime-depot-415622.BikeShare.trips19v1`
     UNION ALL
     SELECT
       ride_id,
       started_at,
       ended_at,
       start_station_id,
       start_station_name,
       end_station_name,
       end_station_id,
       CASE
         WHEN  member_casual = 'member' THEN 'Subscriber'
         WHEN  member_casual = 'casual' THEN 'Customer'
         ELSE  member_casual
       END
     FROM `prime-depot-415622.BikeShare.trips20`
     ) AS subquery
   WHERE 
     (start_station_name <> "HQ QR" AND end_station_name <> "HQ QR")
     AND
     TIMESTAMP_DIFF(subquery.ended_at, subquery.started_at, SECOND) > 0;
   ) AS subquery2
GROUP BY
subquery2.member_casual;

OUTPUT
member_casual      trip_count
Subscriber          720311 
Casual               67877 

group & sort

Let’s group the number of trips by user (member_casual) and group by day of  the  week.

#___________________________We just add the code in green to previous block:
SELECT
   subquery2.member_casual,
   subquery2.day_of_week,
   COUNT(*) AS trip_count
FROM
  (
  SELECT
     subquery.member_casual,
     EXTRACT(DAYOFWEEK FROM subquery.started_at) AS day_of_week,
     EXTRACT(MONTH FROM subquery.started_at) AS month,
     EXTRACT(YEAR FROM subquery.started_at) AS year,
     EXTRACT(QUARTER FROM subquery.started_at) AS quarter,
     EXTRACT(DATE FROM subquery.started_at) AS date,
     TIMESTAMP_DIFF(subquery.ended_at, subquery.started_at, SECOND) AS ride_length
  FROM
     ( 
     SELECT
       CAST(ride_id AS STRING) AS ride_id, 
       started_at,
       ended_at,
       start_station_id, 
       start_station_name,
       end_station_name,
       end_station_id,
       member_casual
     FROM `prime-depot-415622.BikeShare.trips19v1`
     UNION ALL
     SELECT
       ride_id,
       started_at,
       ended_at,
       start_station_id,
       start_station_name,
       end_station_name,
       end_station_id,
       CASE
         WHEN  member_casual = 'member' THEN 'Subscriber'
         WHEN  member_casual = 'casual' THEN 'Customer'
         ELSE  member_casual
       END
     FROM `prime-depot-415622.BikeShare.trips20`
     ) AS subquery
   WHERE 
     (start_station_name <> "HQ QR" AND end_station_name <> "HQ QR")
     AND
     TIMESTAMP_DIFF(subquery.ended_at, subquery.started_at, SECOND) > 0;
   ) AS subquery2

GROUP BY
subquery2.member_casual,
subquery2.day_of_week

ORDER BY
subquery2.day_of_week,
subquery2.member_casual;

OUTPUT: I created the same output in various charts with Tableau, as shown below

ANALYZE WITH TABLEAU

Chart analyses

total rides per day

rides per day/year

Let’s see what has happened from year to year for the same quarter Q1. It appears that the number of rides have gone up for all days EXCEPT Thursday and Fridays, where we see the number of rides have  decreased

total rides per year

location of top 5

top 5 start stations

what’s interesting about the chart that follows is:

top 5 v2

top 5 end stations

top5 start=end

breakdown by quarters

The results show an obvious discrepancy  between Subscriber’s trip count on  weekdays (which  is  much higher) than Customers. While on weekend days it’s the opposite.

Let’s see if the results change in different quarters WAIT, the data provided is  for Q1 of both 2019 and 2020 so there is no need to check on that.

That’s one  thing we need to  do is load up all months or create a separate df for each quarter then do the  comparison.

#________We just add the code in green to previous block
SELECT
   subquery2.quarter,
   subquery2.member_casual,
   subquery2.day_of_week,
   COUNT(*) AS trip_count
FROM
  (
  SELECT
     EXTRACT(DAYOFWEEK FROM subquery.started_at) AS day_of_week,
     EXTRACT(MONTH FROM subquery.started_at) AS month,
     EXTRACT(YEAR FROM subquery.started_at) AS year,
     EXTRACT(QUARTER FROM subquery.started_at) AS quarter,
     EXTRACT(DATE FROM subquery.started_at) AS date,
     TIMESTAMP_DIFF(subquery.ended_at, subquery.started_at, SECOND) AS ride_length
  FROM
     ( 
     SELECT
       CAST(ride_id AS STRING) AS ride_id, 
       started_at,
       ended_at,
       start_station_id, 
       start_station_name,
       end_station_name,
       end_station_id,
       member_casual
     FROM `prime-depot-415622.BikeShare.trips19v1`
     UNION ALL
     SELECT
       ride_id,
       started_at,
       ended_at,
       start_station_id,
       start_station_name,
       end_station_name, 
       end_station_id,
       CASE
         WHEN  member_casual = 'member' THEN 'Subscriber'
         WHEN  member_casual = 'casual' THEN 'Customer'
         ELSE  member_casual
       END
     FROM `prime-depot-415622.BikeShare.trips20`
     ) AS subquery
   WHERE 
     (start_station_name <> "HQ QR" AND end_station_name <> "HQ QR")
     AND
     TIMESTAMP_DIFF(subquery.ended_at, subquery.started_at, SECOND) > 0;
   ) AS subquery2

GROUP BY
subquery2.member_casual,
subquery2.day_of_week

ORDER BY
subquery2.quarter,
subquery2.day_of_week,
subquery2.member_casual;

same station rides

Let’s see how many rides ended at the location where they started:

WHERE
  subquery2.start_station_id = subquery2.end_station_id
OUTPUT   23626

Observations

If we group trip_count and sort it in order, a glaring figure jumps out at us: the top 3 stations in  rides (by 50% more than the fourth place are in this order: Station # 192, 91, 77

1
Subscriber
192
13799
2
Subscriber
91
13434
3
Subscriber
77
12891
4
Subscriber
133
8720

If we add the other users to the data and calculate it for all users, we get

1
192
14155
2
91
13640
3
77
13362

As you see the Customer users account for:

  • 356 at station 192 out of the total of 14155 trips
  • 206 at station 91 out of the total of 13640 trips
  • 471 at station 133 out of the total of 13362
observations
  • The company claims that “Cyclistic users are more likely to ride for leisure, but about 30% use the bikes to commute to work each day”, how is that numbered calculated?
  • The two plans offered to casual riders: single-ride passes and full-day passes are not tracked in the data provided, all we have is user type: “Customer” or “Subscriber”
  • The data provided doesn’t link any of the rides to an actual account. Why don’t we track by userId? Don’t Subscribers have to create an account? Why are we just sorting the data by “Member Casual”.
  • We can gather more insight if the rideId is linked to a userId, is the data available but not provided for analysis?
  • How are docking station locations related to subway, train, or bus stations?
  • Are docking stations near parking structures or major tourist, business centers?
  • Are users coming to the city via other mode of transportation and using bikes within the city for commuting?
  • The objective of the project is to find ways to convert “Customers” to “Subscribers” which is logical except the data provided doesn’t support the theory proposed by Mrs. Moreno. The data shows a large discrepancy between the two users, as Customers account to 9% of Subscribers usage. See table below.
  • The data does not show how many actual users are represented by each type. It is illogical to use rideId as userId. The data provided does not promote an insightful analysis regarding the particular hypothesis presented.
  • The idea that converting 9% of rides will improve the bottom line, sure but at what cost? How many users are we converting? How many Subscribers do we already have?
  • Considering the fact that I am not provided with data relevant to the proposed hypothesis, I would shift my focus on other issues that the data has exposed.
  • The facts that weekday ride numbers have been on the decrease for the busiest stations is alarming (see the last table below). The top stations are very close in proximity to one another, so it is possible that users are not getting good service or possibly a competitor has entered the arena and is targeting that specific small area where the top stations are located. Maybe inventory at those stations doesn’t support the volume of rides initiated from there?
  •  The fact that inventory at those stations is not tracked needs to be addressed.
  • The top stations are far more important to the bottom line than wasting resources on the hypothesis that has been proposed, with the data provided. We cannot worry about converting 9% of rides while the top 3 stations are losing rides at a higher pace and by larger numbers than 9%.

RECOMMENDATIONS

    • Modify the data we track.
    • Implement userId so we can focus our analysis and be more insightful.
    • Focus on UX at the top stations.
    • Stabilize ride numbers at the top stations.
    • Reassess the situation in the near future after modifications have been implemented.