Contents
COMPARE CARE
This document is part of a long, detailed, real business case that took months to complete. My analysis and recommendations led to the creation of a care comparison mobile application months after the release of the data. The information I gathered from this study opened my eyes to many hidden procedures, that the industry wanted to remain buried, and still are. As the healthcare industry took many sharp turns since the release of this data, it’s only the tip of what we’ll never know about the inner workings of our healthcare industry.
This wasn’t the first time for me in the healthcare domain, as I was lead analyst for a complete build of an EHR system for a large hospital in the Midwest, long before the government demanded that all hospitals implement the technology.
Years after these two projects, I was lead analyst for another privately funded EHR dedicated solely for the Homeless population.
Code Explanation
The first part of this document is geared towards case studies and results. For coding, analysis details and step by step instructions, I saved it all for the last section of this document titled RANK ALL STATES. Please, review that section at the bottom of this document for all the details.
Government Background
The Centers for Medicare & Medicaid Services (CMS) and the nation’s hospitals work collaboratively to publicly report hospital quality performance information on Care Compare on Medicare.gov and the Provider Data Catalog.
Hospital Care Compare displays hospital performance data in a consistent, unified manner to ensure the availability of credible information about the care delivered in the nation’s hospitals. The hospitals displayed on Care Compare are generally limited to Acute Care Hospitals, Acute Care Veteran’s Hospitals, Department of Defense Hospitals, Critical Access Hospitals, and Children’s Hospitals. Only data from Medicare-certified hospitals are included on Care Compare. Most of the participants are short-term acute care hospitals that will receive a reduction to the annual update of their Medicare fee-for-service payment rate if they do not participate by submitting data or meet other requirements of the Hospital Inpatient Quality Reporting (IQR) Program and the Hospital Outpatient Reporting (OQR) Program. The Hospital IQR Program was established by Section 501(b) of the Medicare Modernization Act of 2003 and extended and expanded by Section 5001(a) of the Deficit Reduction Act of 2005. The Hospital OQR Program was mandated by the Tax Relief and Health Care Act of 2006.
Finally, in the early 2010’s the president ordered the government to release the data (that has been collected for over a decade) to the public. That’s when my employer requested, I review the data and provide a business plan for the company to capitalize on the information.
As part of any project, it is extremely important to document every little detail so everyone can understand how we arrived at the final answer.
What follows is a quick walk through several case studies leading to a very detailed breakdown in the final case study. Sort of a “lesson learned” document so anyone reviewing the code will follow my train of thought and can easily retrace and duplicate the steps.
I’ve omitted all the work I did using SQL and duplicated it with R.
DATA
The data was downloaded as a large zip file which contained hundreds of .csv files going back for over 10 years. The file was downloaded locally and then the process started.
Note: this project was done over 10 years ago.
data input
# select() is in tidyverse so load it library(tidyverse) #_______________Let's copy the function from before, that reads our datain getfile <- function(file){ fileDir <- setwd("~/CompareCare/") wantedFile = file.path(fileDir,file,fsep="/") return(read.csv(wantedFile)) } datain <- getfile("outcome-of-care-measures.csv")
INSPECT
class
Let’s look at the classes of all the columns in the dataset, first let’s see what type of object it is:
class(datain)
[1] "data.frame"
lapply
Let’s use lapply to find out what the class of each column is
class_list <- lapply(datain,class) class_list list[46] List of length 46 $Provider.Number [1] "character" $Hospital.Name [1] "character" $Address.1 [1].....
MORTALITY RATES
Case Study
This is more of exploratory rather than a case study. Since we’re inspecting the data let’s take a quick look at some of the mortality rates for all hospitals across one condition, just to get an idea of how the values are spread. I’ll do that by plotting a quick histogram of the mortality rates for “heart attacks”. I’m curious as to how wide the data varies across states and outcome.
hist
- Review the attached pdf provided by the government
- Study all the variables in the dataset
- Plot a histogram of the 30 day death rates from heart attacks (col #11) as described in the provided pdf
outcome[ ,11] <- as.numeric(outcome[ ,11]) Warning message: NAs introduced by coercion hist(outcome[ ,11])
Observations
I didn’t expect the rate to vary by 10% (or double) across hospitals. It might not sound like much for the rates to vary by 10% but when that’s double the rate it does raise my curiosity level. This has become extremely interesting; I wonder if there will be a correlation between the extremes. This is when a shark smells blood in the water!
BEST IN STATE
Case Study
Now that I have an idea of what these rates look like, let’s:
- Write a function(best) that accepts 2 arguments: 2 char abbreviated state name, and an outcome name; in order to
- Find the best hospital in each state, names can be found in column 2 –
Hospital.Name
- Mortality rates are tracked in Column 11 –
Hospital.30.Day.Death..Mortality..Rates.from.Heart.Attack
- Column 17 –
Hospital.30.Day.Death..Mortality..Rates.from.Heart.Failure
- Column 23 –
Hospital.30.Day.Death..Mortality..Rates.from.Pneumonia
- Hospitals with no data on that specific outcome should be excluded from that set
- If a tie exists in that set, then sort Hospital names in alphabetical order and the first on that list will be the winner of the tie
- If an invalid state is passed, function should throw an error via the stop function “invalid state”. Column 7 –
State
- If an invalid outcome is passed to best, function should throw an error via the stop function “invalid outcome”
getfile
This will be a duplicate of how we already read the data above but I’ll put it in a function so we can simplify the process for other studies planned during the process.
Let’s start by creating a function that reads the data in:
#_______________Let's create a function that reads our datain
getfile <- function(file){
fileDir <- setwd("D:/~/CompareCare/")
wantedFile = file.path(fileDir,file,fsep="/")
return(read.csv(wantedFile))
}
datain <- getfile("outcome-of-care-measures.csv")
best
Create the function that satisfy the case study:
best <- function(state, outcome) { #read data in using the function getfile() datain <- getfile("outcome-of-care-measures.csv") ## Check that state and outcome are valid if not stop and issue a warning if (!(state %in% datain[,7])){ stop("invalid state")} if (!((outcome == "heart attack") | (outcome == "heart failure") | (outcome == "pneumonia"))) {stop("invalid outcome")} # Create a subset for the only state given by the user # we could use stateonly <- filter(datain, State == state) note state is the argument passed into the function stateonly <- subset(datain, State == state) # Assign outcome input to appropriate column so we can filter that out if (outcome == "heart attack") {usethis = 11} if (outcome == "heart failure") {usethis = 17} if (outcome == "pneumonia") {usethis = 23} # Let's filter/subset the stateonly subset to include Hospital Name, and the outcome given by user subdata <- select(stateonly, c("State", "Hospital.Name", all_of(usethis))) # "Not Applicable" in the 3rd column is not detected by any of the NA functions so I'll do it manually gooddata <- subset(subdata, subdata[3] != "Not Available") #rate in df is a char, so unlist it and coerce it to a numeric gooddata[3] <- as.numeric(unlist(gooddata[3])) # Arrange in ascending order of rate=col[3] first and name=col[2] second to break a tie if it exists #and assign the row number to the new rank column blah <- gooddata |> arrange((gooddata[3]), gooddata[2]) |> mutate(rank = row_number()) #return the hospital name with the lowest rank or row 1 col 2 blah <- (blah[1,2]) }
Sample output
b <- best("TX", "heart attack") [1] "CYPRESS FAIRBANKS MEDICAL CENTER" b <- best("TX", "heart failure") [1] "FORT DUNCAN MEDICAL CENTER" b <- best("MD", "heart attack") [1] "JOHNS HOPKINS HOSPITAL, THE" b <- best("MD", "pneumonia") [1] "GREATER BALTIMORE MEDICAL CENTER" b <- best("BB", "heart attack") Error in best("BB", "heart attack") : invalid state b <- best("NY", "hert attack") Error in best("NY", "hert attack") : invalid outcome
RANK BY STATE
Case Study
Continuing with our exploratory analysis, let’s create a function that returns the hospital for a ranking that we provide, in the state we provide, and for the mortality rate that we provide. For example, let’s say the user wants to know which hospital(s) ranked as top 3, or worst 6 for heart failure in the state of TX?
- The function rankhospital() will take 3 arguments: State, outcome, rank
- This time we want to pull that specific hospital that meets that rank for that specific outcome
- Just as before if state or outcome are invalid execute a stop() with message
- Rank argument accepts “best”, “worst”, or an integer indicating the desired rank, the lower the number the better the rank
- If rank argument is greater than the ranks for the outcome, return NA
- As before, a tie is decided by the alphabetical order of hospitals
- Output is a character vector of the hospital name meeting the condition
getfile
I’ll use getfile() created earlier to load the data.
rankhospital
#______________Function to return the hospital with the desired ranking based on outcome given rankhospital <- function(state, outcome, num = "best") { #read data in using the function getfile() datain <- getfile("outcome-of-care-measures.csv") ## Check that state and outcome are valid if (!(state %in% datain[,7])){ stop("invalid state")} if (!((outcome == "heart attack") | (outcome == "heart failure") | (outcome == "pneumonia"))) {stop("invalid outcome")} # Create a subset for the only state given by the user # we could use stateonly <- filter(datain, State == state) note state is the argument passed into the function stateonly <- subset(datain, State == state) # Assign outcome to appropriate column so we can filter that out if (outcome == "heart attack") {usethis = 11} if (outcome == "heart failure") {usethis = 17} if (outcome == "pneumonia") {usethis = 23} # Let's filter the stateonly subset to include Hospital Name, and the outcome given by user subdata <- select(stateonly, c("State", "Hospital.Name", all_of(usethis))) # Not Applicable in the 3rd column is not detected by any of the NA functions so I'll do it manually gooddata <- subset(subdata, subdata[3] != "Not Available") #now that we've removed empty rows we can test num input and set variables if (num == "best") {num = 1} if (num == "worst") {num = nrow(gooddata)} #rate in df is a char, so unlist it and coerce it to a numeric gooddata[3] <- as.numeric(unlist(gooddata[3])) # Arrange in ascending order of rate=col[3] first and name=col[2] second to break a tie if it exists # Create rank and create new col blah <- gooddata |> arrange((gooddata[3]), gooddata[2]) |> mutate(rank = row_number()) #return the hospital name with the desired rank in blah if (num > nrow(gooddata)) { blah = "NA"} else {blah <- (blah[num,2])} } yep <- rankhospital("TX", "heart attack", 2) [1] "HOUSTON NORTHWEST MEDICAL CENTER" yep <- rankhospital("TX", "heart attack", 1) [1] "CYPRESS FAIRBANKS MEDICAL CENTER" yep <- rankhospital("TX", "heart attack", "best") [1] "CYPRESS FAIRBANKS MEDICAL CENTER" yep <- rankhospital("TX", "heart attack", 102) [1] "LONGVIEW REGIONAL MEDICAL CENTER" yep <- rankhospital("TX", "heart attack", "worst") [1] "LAREDO MEDICAL CENTER" yep <- rankhospital("TX", "heart attack", 2000) [1] "NA" yep <- rankhospital("TX", "heart failure", 4) [1] "DETAR HOSPITAL NAVARRO" > yep <- rankhospital("TX", "heart attack", "worst") [1] "LAREDO MEDICAL CENTER" yep <- rankhospital("MD", "heart attack", "worst") [1] "HARFORD MEMORIAL HOSPITAL" yep <- rankhospital("MN", "heart attack", 5000) [1] "NA"
RANK ALL STATES
Ok, so let’s back up a bit. Let’s say the user wants to know which state has the best rates for each outcome. Let’s say we want to narrow down the states with the best mortality rates for heart attacks. Or maybe the worst states for pneumonia?
Case Study
Create a function rankall() that:
- Takes 2 args: (outcome) and hospital ranking (num)
- Outcome is identical to what we used before
- (num)
- Output is a 2 column df containing the hospital in each state that satisfies the rank=num specified by user
- First column is named: (hospital) and second is (state) containing 2-char abbreviation
- As before if a hospital doesn’t have data for that outcome it should be excluded
- num just as before could be “bet”, “worst”, an integer within the range or out of range of the ranks
- Some states might have NA values as no hospital might meet the ranking requirements, it’s ok. This part could be tricky if you misread it. To deal with it:
- I’ll create a list of all states in the dataset
- Once user enters a num, and the dataset is filtered to that value, I’ll compare this subset to the list of all states and enter NA for any rows in the list of all states that are not in the new subset (filtered) dataset
- Ranking ties are handled as before
- Input validity is as before
PROCESS
Before we start: This section goes through many steps that are more of a teaching moment for others on the team. Even though I already knew which function or code I was planning to use for the function, I took extra steps to outline how other methods could be used to achieve the same outcome. At this point I had many members on the team that had ideas and ways that they wanted to test, so I included in the document as a “lesson learned” document for future reference.
List Groups
Let’s keep inspecting the data, we know we have to extract data by state eventually, so how many states/groups are in the dataframe?
There are so many ways to find that out let me see if I can hit on a few of them here, we can start with lapply, we just used it above so let’s start with it.
lapply
extract column
Here is the final statement, but let’s break it down piece by piece:
state_list <- sort(unique(unlist(lapply(datain,function(santa){datain[7]} ))))
I’ll create an anonymous function to extract the seventh column (State column). This will give us a list of 46 dfs one for each column with all the data grouped by datain[7] for all rows, that’s 4706 rows per column the total rows in the entire dataset. I’ve included a part of the output:
- lapply always puts out a list
- each column is grouped according to the santa() the function based on column [7] the state column
- As you can see, it starts off with AL, well the reason being is that happens to be the first state in the datain dataset.
- I know there is a state AK which would be the first state after the data is sorted.
- So be aware that the data is not sorted, just grouped
Here is how it breaks down:
howmanystates <- lapply(datain,function(santa){datain[7]})
str(howmanystates)
List of 46
$ Provider.Number :'data.frame':
4706 obs. of 1 variable:
..$ State: chr [1:4706] "AL" "AL" "AL" "AL" ...
$ Hospital.Name :'data.frame':
4706 obs. of 1 variable:
..$ State: chr [1:4706] "AL" "AL" "AL" "AL" ...
$ Address.1 :'data.frame':
4706 obs. of 1 variable:
..$ State: chr [1:4706] "AL" "AL" "AL" "AL" ...
$ Address.2 :'data.frame':
4706 obs. of 1 variable:
..$ State:
alter function
Instead of extracting State column[7], let’s look for unique values in column[7] using lapply again. Well the results are not close but close!
- We end up with a df (46,54), that’s 46 columns for the original columns and 54 rows for all the unique State values column[7].
- The funny part is we start with row numbered 1 with every value in that row being AL
- Second row is numbered 99, with every value being AK for all 46 columns
- Third row is numbered 116, with every value being AZ in every column
- What’s interesting is that the total count of occurrences for AL is 98 as you’ll see down below when we count the occurrence of each state
- The number for row 3 is 116 which is 17 higher than row 2 which corresponds to
- The total number of occurences for AK is 17
- So in a way it does provide us with a count for occurrences for each state in a roundabout way
lapply_list <- lapply(datain, function(ho){unique(datain[7])})
unlist
What happens if we unlist howmanystates from above? Let’s try it. We get a list a list of values xxxx long. What happened is unlist will flatten the previous list of dataframes into one long list and we end up flattening a df that had 46 columns and thousands of rows into one long list of all the values.
howmanystates2 <- unlist(lapply(datain,function(santa){datain[7]}))
length(howmanystates2)
[1] 216476
That’s exactly what we want! Let’s now pull in the unique values for State!
unique
As mentioned above let’s see how many unique states are in the dataset. We get a list of characters 54 elements long.
howmanystates2 <- unique(unlist(lapply(datain,function(santa){datain[7]})))
[1] "AL" "AK" "AZ" "AR" "CA" "CO" "CT" "DE" "DC" "FL" "GA" "HI" "ID" "IL" "IN" "IA" "KS" "KY" "LA" "ME"
"MD" "MA" "MI" "MN" "MS" "MO" "MT"
[28] "NE" "NV" "NH" "NJ" "NM" "NY" "NC" "ND" "OH" "OK" "OR" "PA" "PR" "RI" "SC" "SD" "TN" "TX" "UT" "VT"
"VI" "VA" "WA" "WV" "WI" "WY" "GU"
length(howmanystates2)
[1] 54
sort
As you shown above, the list is not sorted in order so let’s do that with the sort(). Now we have the complete statement –
I changed the name to state_list. We’ll use this later. SAME AS unique_list see below
state_list <- sort(unique(unlist(lapply(datain,function(santa){datain[7]} ))))
[1] "AK" "AL" "AR" "AZ" "CA" "CO" "CT" "DC" "DE" "FL" "GA" "GU" "HI" "IA" "ID" "IL" "IN" "KS"
"KY" "LA" "MA" "MD" "ME" "MI" "MN" "MO" "MS"
[28] "MT" "NC" "ND" "NE" "NH" "NJ" "NM" "NV" "NY" "OH" "OK" "OR" "PA" "PR" "RI" "SC" "SD" "TN"
"TX" "UT" "VA" "VI" "VT" "WA" "WI" "WV" "WY"
split
split() will split the data by State. It will give us a list of 54 dataframes one for each state in the dataset. It is a few steps ahead of using lapply.
Each df is summarized with dimensions of rows and columns which gives us the length of each df, which is the count of rows in each group. The list is sorted in ascending order of State as well.
sgroup <- split(datain,datain$State)
sapply
Instead of using lapply above, we can use sapply, which counts each group of States and we avoid all the other steps. What if we want to pull the length of each out into it’s own list
Now we get count_sgroup which a list of all the states and a column of count/nrow for each state. We used the split data sgroup from above.
count_sgroup <- sapply(sgroup, count)
unique
We could’ve just used unique without lapply() but we still have to sort it. Unique() starts extracting the first element in the State/ column[7] regardless of where in the order it happens to be. It is looking for a unique value in the column not meant to sort the result. So, with sort() we get unique_list, which is the same as state_list of length 54.
unique_list <- sort(unique(datain$State))
SAME AS state_list see above in lapply/sort
n_distinct
Here is the definition of n_distinct in case you missed it from Process page. NA is irrelevant at this stage.
n_distinct() counts the number of unique/distinct combinations in a set of one or more vectors. It’s a faster and more concise equivalent to nrow(unique(data.frame(…))). Usage: n_distinct(…, na.rm = FALSE)
dis_list <- n_distinct(datain$State)
[1] 54
count
Use this one for many reasons. Will copy down to the section below because it provides answers to both sections. So what does this count() do?
- It created a df with two columns
- One column is State
- Second count is (n) the count of occurrence for each state
- sort=TRUE will sort the df in (n) ascending order
- sort=FALSE or omitted will sort the df in ascending order based on State
count_list <- datain |>
count(State)
State n
1 AK 17
2 AL 98
3 AR 77
4 AZ 77
5 CA 341
6 CO 72
7 CT 32
8 DC 8
9 DE 6
10 FL 180
11 GA 132
12 GU 1
13 HI 19
14 IA 109
15 ID 30
16 IL 179
17 IN 124
18 KS 118
19 KY 96
20 LA 114
21 MA 68
22 MD 45
23 ME 37
24 MI 134
25 MN 133
26 MO 108
27 MS 83
28 MT 54
29 NC 112
30 ND 36
31 NE 90
32 NH 26
33 NJ 65
34 NM 40
35 NV 28
36 NY 185
37 OH 170
38 OK 126
39 OR 59
40 PA 175
41 PR 51
42 RI 12
43 SC 63
44 SD 48
45 TN 116
46 TX 370
47 UT 42
48 VA 87
49 VI 2
50 VT 15
51 WA 88
52 WI 125
53 WV 54
54 WY 29
rename
Rename (n) to howmany by adding the last line
count_list <- datain |> count(State) |> rename( howmany = n)
group_by
Of course, group_by can be used. This chunk of code
- Creates a df with the rows grouped by states with the first group being for AL because that happens to be the first row in the datain dataset.
- Number of rows remains the same as the original datain set because all we’ve done is group
- A column labeled n displays the count of occurrences for each group/state, repeated for every row in that group till the next group starts
- So for example AL will have 98 in that column and the next 97 rows will all have 98 till the next group start
- Also another column that’s the sort column is created, in this case that column is labeled datain$State
- Continue the next phase down in filter below
groupdata <- datain |> group_by(datain$State) |> mutate(found = n())
Count per Group
count per state
Let’s see how many entries per state. We could save the information if we need it
table(datain$State)
AK AL AR AZ CA CO CT DC DE FL GA GU HI IA ID IL IN KS KY LA
17 98 77 77 341 72 32 8 6 180 132 1 19 109 30 179 124 118 96 114
MA MD ME MI MN MO MS MT NC ND NE NH NJ NM NV
68 45 37 134 133 108 83 54 112 36 90 26 65 40 28
NY OH OK OR PA PR RI SC SD TN TX UT VA VI VT WA WI WV WY
185 170 126 59 175 51 12 63 48 116 370 42 87 2 15 88 125 54 29
count & list of groups
This is a very handy way of creating a list as well as the count of each group/state saved in a df. I prefer to use this if I needed a separate list. See List Groups duplicate section explaining this block of code as well as the output for it
count_list <- datain |> count(State)
Observations
- 54 states have data
- Not all states have the same count which means that some states will not not have values depending on what the user inputs in num=XX. For example if the user enters num=25, then right off the bat we know AK, DC, DE, GU, HI, RI, VI, VT will not have any entries at that ranking and the other 48 MIGHT, I say might
- Just because a state has a specific count it doesn’t mean that every column will have a value up to the nrow, so we have to know how to handle those missing values
- All mortality rate columns are of “character” class and we need to output a value related to the rank of each hospital, so we need to coerce these columns to numeric
- We have 46 columns and we only need a handful of columns for this function
CLEAN
Arguments
Since we are going to be writing a function, we want to guard against bad input from the user (wrong values as arguments)
- The function rankall() we are about to create takes 2 arguments
- outcome – which could be any of these three: heart attack, heart failure, pneumonia
- num – takes an integer, or “best” or “worst” with best meaning the lowest rating and worst meaning the highest rating
- If outcome is wrong we want to issue a warning and stop the execution
- Look in the documentation for more information as to which column corresponds to each outcome.
- As listed above, the corresponding columns are: 11, 17, 23
- We already know the State column is column 7
- Hospital Name column is column 2 or Hospital.Name
Here is how we’ll deal with validating the arguments:
# check validity of (outcome input) if (!((outcome == "heart attack") | (outcome == "heart failure") | (outcome == "pneumonia"))) {stop("invalid outcome")} # Assign outcome to appropriate column so we can filter that out if (outcome == "heart attack") {usethis = 11} if (outcome == "heart failure") {usethis = 17} if (outcome == "pneumonia") {usethis = 23}
Filter
Let’s filter out the columns we need based on the argument. Based on what I created above we can now use (usethis) as the filter
select
Let’s narrow down the dataset to the needed columns
# Let's filter out columns not needed
subdata <- select(datain, c("State", "Hospital.Name", all_of(usethis)))
remove na
Many rows have Not Applicable as value, and all na fuctions failed to remove them, so I’ll just remove them manually
gooddata <- subset(subdata, subdata[3] != "Not Available")
coerce values
Remember that all the rate columns are of type “char”, so instead of coercing the entire dataset, I chose to just work with the column that’s needed based on the input to the function. Since we already created a subset of the 3 columns we extracted and named it gooddata we’ll use that and coerce column 3
gooddata[3] <- as.numeric(unlist(gooddata[3]))
group_by
See group_by section above.
rank
arrange
So far, we’ve
- Loaded the data
- Inspected the data
- Validate arguments coming into the function
- Removed “Not Available” from rows
- Filtered the data to the “outcome” specified by the arguments
- Coerced the outcome column from char to numeric
- So now let’s group the data into states and sort in order
- Within each state we want sort mortality rate “outcome” in ascending order
- We want to create a ranking for each row within that group/state
- Basically, we want to know which the rank for each hospital within its state
- If two hospitals are tied with the same rate, we need to choose the one that occurs first alphabetically
Let’s group the data by state, count the occurrences of observations in each state
groupdata <- gooddata |> group_by(gooddata$State) |> mutate(found = n())
Arrange within each group/state the rate (column 2) and if a tie exists, we need to choose the one that occurs first alphabetically
rankedgroups <- groupdata |> arrange((groupdata[3]), groupdata[2], .by_group = TRUE) |> mutate(rank = row_number())
Here is what I did in the code above:
- arranged first by [3] which is the rate, so now it’ll all be sorted in ascending order with the best(lowest value) on top
- then we arrange it by [2] which is Hospital.Name so in case of a tie the one that occured first will have the higher rank
- create a new column where we assign the rank of each hospital within that state
Conditions
If you remember above, I tested for the arguments but not all of them. I didn’t test for “best”, “worst” and how to extract that data. Let’s look at the code first and then I’ll explain
if (num == "best") {num = 1} if (num == "worst") {num = rankedgroups$found} #filter rankedgroups to show rows with rank = input rankedout <- subset(rankedgroups, rankedgroups$rank == num)
If you remember from before:
- I had created a column labeled found in the first group_by function. That value is the most rows that State had for that outcome, which would make it the max, in other words it’s the “worst” rank.
- So if the user inputs “worst” we assign num = rankedgroups$found
- We create a df with only the ranked value num so we use subset() with the condition rankedgroups$rank == num
extract group by rank
So now let’s create that df I mentioned a line up. We want to take the num argument and filter out the data to only show the hospitals that meet that rank. For example, if we want the top 5 hospitals for that outcome in each state we set num=5
rankedout <- subset(rankedgroups, rankedgroups$rank == num)
group list
Remember earlier we used several ways to extract a list of the groups/states and the count of occurrences/observations for each? Here we use it again
count_list <- datain |> count(State) |> rename( found = n)
merge lists
Now that we have a list of all the states (count_list) and a list of all the filtered, ranked Hospital Names for each State (rankedout) we can merge them together
- Note: I’ll use count_list on the left because that’s the list of ALL states in the dataset
- One of the conditions was to display any state that has missing values as NA
- We don’t need all the columns from (rankedout) df so we’ll just select the two we want
almostlist <- merge(count_list, rankedout[,c("State", "Hospital.Name")], by= "State", all.x = TRUE)
drop column
count_list had a column (foun) that is no longer needed so we can drop it
almostlist <- subset(almostlist, select= -c(found))
ENTIRE FUNCTION
Rankall
I’ve removed all the comments since we covered everything in detail up above.
# select() is in tidyverse so load it library(tidyverse) library(dplyr) #_______________Let's copy the function from before, that reads our datain getfile <- function(file){ fileDir <- setwd("~/CompareCare/") wantedFile = file.path(fileDir,file,fsep="/") return(read.csv(wantedFile)) } datain <- getfile("outcome-of-care-measures.csv") #________________START OF THE FUNCTION_________________________________ rankall <- function(outcome, num = "best") { if (!((outcome == "heart attack") | (outcome == "heart failure") | (outcome == "pneumonia"))) {stop("invalid outcome")} if (outcome == "heart attack") {usethis = 11} if (outcome == "heart failure") {usethis = 17} if (outcome == "pneumonia") {usethis = 23} subdata <- select(datain, c("State", "Hospital.Name", all_of(usethis))) gooddata <- subset(subdata, subdata[3] != "Not Available") gooddata[3] <- as.numeric(unlist(gooddata[3])) groupdata <- gooddata |> group_by(gooddata$State) |> mutate(found = n()) rankedgroups <- groupdata |> arrange((groupdata[3]), groupdata[2], .by_group = TRUE) |> mutate(rank = row_number()) if (num == "best") {num = 1} if (num == "worst") {num = rankedgroups$found} rankedout <- subset(rankedgroups, rankedgroups$rank == num) count_list <- datain |> count(State) |> rename( found = n) almostlist <- merge(count_list, rankedout[,c("State", "Hospital.Name")], by= "State", all.x = TRUE) almostlist <- subset(almostlist, select= -c(found)) }
Sample Output
outlist <- rankall("pneumonia",20) head(outlist,10) State Hospital 1 AK <NA> 2 AL CHILTON MEDICAL CENTER 3 AR BAPTIST HEALTH MEDICAL CENTER HEBER SPINGS 4 AZ SCOTTSDALE HEALTHCARE-SHEA MEDICAL CENTER 5 CA FOUNTAIN VALLEY REGIONAL HOSPITAL & MEDICAL CENTER 6 CO VALLEY VIEW HOSPITAL ASSOCIATION 7 CT MIDSTATE MEDICAL CENTER 8 DC <NA> 9 DE <NA> 10 FL KENDALL REGIONAL MEDICAL CENTER
outlist <- rankall("pneumonia","worst") tail(outlist,3) State Hospital 52 WI MAYO CLINIC HEALTH SYSTEM - NORTHLAND, INC 53 WV PLATEAU MEDICAL CENTER 54 WY NORTH BIG HORN HOSPITAL DISTRICT
outlist <- rankall("heart failure") tail(outlist,10) State Hospital 45 TN WELLMONT HAWKINS COUNTY MEMORIAL HOSPITAL 46 TX FORT DUNCAN MEDICAL CENTER 47 UT VA SALT LAKE CITY HEALTHCARE - GEORGE E. WAHLEN VA MEDICAL CENTER 48 VA SENTARA POTOMAC HOSPITAL 49 VI GOV JUAN F LUIS HOSPITAL & MEDICAL CTR 50 VT SPRINGFIELD HOSPITAL 51 WA HARBORVIEW MEDICAL CENTER 52 WI AURORA ST LUKES MEDICAL CENTER 53 WV FAIRMONT GENERAL HOSPITAL 54 WY CHEYENNE VA MEDICAL CENTER
r <- rankall("heart attack", 4) as.character(subset(r, State == "HI")$Hospital) #[1] "CASTLE MEDICAL CENTER"
r2 <- rankall("pneumonia", "worst") as.character(subset(r2, State == "NJ")$Hospital) #[1] "BERGEN REGIONAL MEDICAL CENTER"
r3 <- rankall("heart failure", 10)
as.character(subset(r3, State == "NV")$Hospital)
[1] "RENOWN SOUTH MEADOWS MEDICAL CENTER"