Linear Regression Project

For this project we will be doing the Bike Sharing Demand Kaggle challenge!

Get the Data

You can download the data or just use the supplied csv in the repository. The data has the following features:

  1. datetime - hourly date + timestamp
  2. season - 1 = spring, 2 = summer, 3 = fall, 4 = winter
  3. holiday - whether the day is considered a holiday
  4. workingday - whether the day is neither a weekend nor holiday
  5. weather -
    1. Clear, Few clouds, Partly cloudy, Partly cloudy
    2. Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
    3. Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
    4. Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
  6. temp - temperature in Celsius
  7. atemp - “feels like” temperature in Celsius
  8. humidity - relative humidity
  9. windspeed - wind speed
  10. casual - number of non-registered user rentals initiated
  11. registered - number of registered user rentals initiated
  12. count - number of total rentals

Read in bikeshare.csv file and set it to a dataframe called bike.

rm(list = ls()) 

cat("\014")  # ctrl+L
bike = read.csv("bikeshare.csv")

Check the head of df

head(bike)
##              datetime season holiday workingday weather temp  atemp humidity
## 1 2011-01-01 00:00:00      1       0          0       1 9.84 14.395       81
## 2 2011-01-01 01:00:00      1       0          0       1 9.02 13.635       80
## 3 2011-01-01 02:00:00      1       0          0       1 9.02 13.635       80
## 4 2011-01-01 03:00:00      1       0          0       1 9.84 14.395       75
## 5 2011-01-01 04:00:00      1       0          0       1 9.84 14.395       75
## 6 2011-01-01 05:00:00      1       0          0       2 9.84 12.880       75
##   windspeed casual registered count
## 1    0.0000      3         13    16
## 2    0.0000      8         32    40
## 3    0.0000      5         27    32
## 4    0.0000      3         10    13
## 5    0.0000      0          1     1
## 6    6.0032      0          1     1

Exploratory Data Analysis

Create a scatter plot of count vs temp.

library(ggplot2)
ggplot(bike, aes(x= temp, y=count, color = temp)) + geom_point(alpha = 1, size=2)

Plot count versus datetime as a scatterplot with a color gradient based on temperature. We need to convert the datetime column into POSIXct before plotting.

bike$datetime <- as.POSIXct(bike$datetime)

ggplot(bike, aes(x= datetime, y=count, color = temp)) + geom_point(alpha = 1, size=2)

We noticed two things: A seasonality to the data, for winter and summer. Also that bike rental counts are increasing in general. This may present a problem with using a linear regression model if the data is non-linear. Let’s have a quick overview of pros and cons right now of Linear Regression:

Pros:

  • Simple to explain
  • Highly interpretable
  • Model training and prediction are fast
  • No tuning is required (excluding regularization)
  • Features don’t need scaling
  • Can perform well with a small number of observations
  • Well-understood

Cons:

  • Assumes a linear relationship between the features and the response
  • Performance is (generally) not competitive with the best supervised learning methods due to high bias
  • Can’t automatically learn feature interactions
  • We’ll keep this in mind as we continue on. Maybe when we learn more algorithms we can come back to this with some new tools, for now we’ll stick to Linear Regression.
What is the correlation between temp and count?
cor(bike[,c('temp','count')])
##            temp     count
## temp  1.0000000 0.3944536
## count 0.3944536 1.0000000
Let’s explore the season data. Create a boxplot, with the y axis indicating count and the x axis begin a box for each season.
ggplot(bike,aes(factor(season),count,color=factor(season) )) + geom_boxplot() + theme_bw()

Notice what this says:

  1. A line can’t capture a non-linear relationship.
  2. There are more rentals in winter than in spring We know of these issues because of the growth of rental count, this isn’t due to the actual season!

Feature Engineering

A lot of times we need to use domain knowledge and experience to engineer and create new features. Let’s go ahead and engineer some new features from the datetime column. Let us Create an “hour” column that takes the hour from the datetime column. WE probably need to apply some function to the entire datetime column and reassign it.

bike$hour <- sapply(bike$datetime,function(x){format(x,"%H")})
Now create a scatterplot of count versus hour, with color scale based on temp. Only use bike data where workingday==1.

Optional Additions:

Use the additional layer: scale_color_gradientn(colors=c(‘color1’,color2,etc..)) where the colors argument is a vector gradient of colors you choose, not just high and low. Use position=position_jitter(w=1, h=0) inside of geom_point() and check out what it does.
library(dplyr)
fig1 <-ggplot(filter(bike, workingday == 1), aes(x= hour, y=count, color = temp)) + geom_point(position=position_jitter(w=1, h=0))
fig1<- fig1 +scale_color_gradientn(colors=c('blue', 'red', 'green',' orange', 'yellow'))
fig1

Now create the same plot for non working days:
fig1 <-ggplot(filter(bike, workingday == 0), aes(x= hour, y=count, color = temp)) + geom_point(position=position_jitter(w=1, h=0))
fig1<- fig1 +scale_color_gradientn(colors=c('blue', 'red', 'green',' orange', 'yellow'))
fig1

Building the Model

Using lm() to build a model that predicts count based solely on the temp feature,and name it temp.model

#?lm
temp.model <- lm(count~temp,bike)
Get the summary of the temp.model
summary(temp.model)
## 
## Call:
## lm(formula = count ~ temp, data = bike)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -293.32 -112.36  -33.36   78.98  741.44 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6.0462     4.4394   1.362    0.173    
## temp          9.1705     0.2048  44.783   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 166.5 on 10884 degrees of freedom
## Multiple R-squared:  0.1556, Adjusted R-squared:  0.1555 
## F-statistic:  2006 on 1 and 10884 DF,  p-value: < 2.2e-16
How many bike rentals would we predict if the temperature was 25 degrees Celsius? Calculate this two ways:
  1. Using the values we just got above
  2. Using the predict() function
# methos one using the intercept 6.0462 = beta0 and and temp =9.1705

 6.0462 + (9.1705*25)
## [1] 235.3087
#methos 2 the model 

predict(temp.model, data.frame(temp=c(25)))
##        1 
## 235.3097
Using sapply() and as.numeric to change the hour column to a column of numeric values.
bike$hour <- sapply(bike$hour, as.numeric)
Finally build a model that attempts to predict count based off of the following features. Figure out if theres a way to not have to pass/write all these variables into the lm() function. Hint: StackOverflow or Google may be quicker than the documentation.
  • season
  • holiday
  • workingday
  • weather
  • temp
  • humidity
  • windspeed
  • hour (factor)
model_2  <- lm(count ~ . -casual - registered -datetime -atemp,bike )
Get the summary of the model
summary(model_2)
## 
## Call:
## lm(formula = count ~ . - casual - registered - datetime - atemp, 
##     data = bike)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -324.61  -96.88  -31.01   55.27  688.83 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.91369    8.45147   5.551 2.91e-08 ***
## season       21.70333    1.35409  16.028  < 2e-16 ***
## holiday     -10.29914    8.79069  -1.172    0.241    
## workingday   -0.71781    3.14463  -0.228    0.819    
## weather      -3.20909    2.49731  -1.285    0.199    
## temp          7.01953    0.19135  36.684  < 2e-16 ***
## humidity     -2.21174    0.09083 -24.349  < 2e-16 ***
## windspeed     0.20271    0.18639   1.088    0.277    
## hour          7.61283    0.21688  35.102  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 147.8 on 10877 degrees of freedom
## Multiple R-squared:  0.3344, Adjusted R-squared:  0.3339 
## F-statistic:   683 on 8 and 10877 DF,  p-value: < 2.2e-16

A linear model like the one we chose which uses OLS won’t be able to take into account seasonality of our data, and will get thrown off by the growth in our dataset, accidentally attributing it towards the winter season, instead of realizing its just overall demand growing! Later on, we’ll see if other models may be a better fit for this sort of data. We should have noticed that this sort of model doesn’t work well given our seasonal and time series data. We need a model that can account for this type of trend, read about Regression Forests for more info if you’re interested!