taxi demand prediction II

Now after cleaning, preprocessing we have data of jan 2015 and jan, feb, and march 2016.

Now we will use some base line models to predict pickups in the next 10 minutes.

1. Simple Moving Average of ratio:
2. Using Previous known values of the 2016 data itself to predict the future values

Rt = Pt(2016) / Pt(2015)

The First Model used is the Moving Averages Model which uses the previous n values in order to predict the next value

for predicting pickup value of next 10 minutes, we have to take ratio of previous n 10 minutes bins.
where n is hyper parameter.
in ratio, we use both 2015 data and 2016 data.
next is moving average of only 2016 values


In the
here also n is hyperparameter. we observed that using n=1 i.e just using previous value, we are getting minimum error.
In the previous know values, we use only 2016 data.

3. Weighted moving average of ratio:


Here N is the hyper parameter.

4. Weighted moving average of previous values:



5. Exponential weighted moving average of ratio and previous value:





here Pt-1 is previous value of jan_2015 and P't-1 is previous value of jan_2016.

here alpha is between 0 and 1. 

in weighted moving average and exponential weighted moving average we give more weightage to latest point.

here our exponential moving average of previous value works well.


----------------------------------------------------------------
 
till now we used baseline models, now we'll use machine learning model.
for this we have to split data into train and test. our data is temporal so we can't split it randomly for this we have to split on the basis of time. 70% of data into train and 30% into test.

here we don't have cross validation data so we'll use k fold cross validation.



We have taken last 5 values of 10 mins pickup slot, lat, long and weekday.
above we have seen that exponential moving average works well hence we have added it as feature.

here we used random forest, xgboost and linear regression.

the following is the feature importance of random forest.


Most important feature is exponential average then t-1 value i.e ft_1

we got around 12% MAPE.
random forest is slightly overfitting but not that much. xgboost is giving lowest MAPE.
RMSE is not interpret-able but MAPE is interpret-able which is percentage error. 



Comments

Popular posts from this blog

Stack Overflow Tag Prediction

Taxi demand prediction