taxi demand prediction II
Now after cleaning, preprocessing we have data of jan 2015 and jan, feb, and march 2016.
Now we will use some base line models to predict pickups in the next 10 minutes.
1. Simple Moving Average of ratio:
2. Using Previous known values of the 2016 data itself to predict the future values
Rt = Pt(2016) / Pt(2015)
The First Model used is the Moving Averages Model which uses the previous n values in order to predict the next value
for predicting pickup value of next 10 minutes, we have to take ratio of previous n 10 minutes bins.
where n is hyper parameter.
in ratio, we use both 2015 data and 2016 data.
next is moving average of only 2016 values
Now we will use some base line models to predict pickups in the next 10 minutes.
1. Simple Moving Average of ratio:
2. Using Previous known values of the 2016 data itself to predict the future values
Rt = Pt(2016) / Pt(2015)
The First Model used is the Moving Averages Model which uses the previous n values in order to predict the next value
for predicting pickup value of next 10 minutes, we have to take ratio of previous n 10 minutes bins.
where n is hyper parameter.
in ratio, we use both 2015 data and 2016 data.
next is moving average of only 2016 values
In the
here also n is hyperparameter. we observed that using n=1 i.e just using previous value, we are getting minimum error.
In the previous know values, we use only 2016 data.
3. Weighted moving average of ratio:
4. Weighted moving average of previous values:
5. Exponential weighted moving average of ratio and previous value:
here Pt-1 is previous value of jan_2015 and P't-1 is previous value of jan_2016.
here alpha is between 0 and 1.
in weighted moving average and exponential weighted moving average we give more weightage to latest point.
here our exponential moving average of previous value works well.
----------------------------------------------------------------
till now we used baseline models, now we'll use machine learning model.
for this we have to split data into train and test. our data is temporal so we can't split it randomly for this we have to split on the basis of time. 70% of data into train and 30% into test.
here we don't have cross validation data so we'll use k fold cross validation.
We have taken last 5 values of 10 mins pickup slot, lat, long and weekday.
above we have seen that exponential moving average works well hence we have added it as feature.
here we used random forest, xgboost and linear regression.
the following is the feature importance of random forest.
Most important feature is exponential average then t-1 value i.e ft_1
we got around 12% MAPE.
random forest is slightly overfitting but not that much. xgboost is giving lowest MAPE.
RMSE is not interpret-able but MAPE is interpret-able which is percentage error.
Comments
Post a Comment