Taxi demand prediction
* Problem statement: Here we want to predict demand of taxi in next 10 minutes in each cluster.
* Solution: Here data is huge hence we used dask library (dask divide large data into smaller chunks and later combine them)
* Solution: Here data is huge hence we used dask library (dask divide large data into smaller chunks and later combine them)
['VendorID', 'tpep_pickup_datetime', 'tpep_dropoff_datetime',
'passenger_count', 'trip_distance', 'pickup_longitude',
'pickup_latitude', 'RateCodeID', 'store_and_fwd_flag',
'dropoff_longitude', 'dropoff_latitude', 'payment_type', 'fare_amount',
'extra', 'mta_tax', 'tip_amount', 'tolls_amount',
'improvement_surcharge', 'total_amount']
we have this data.
* Metrics: Mean absolute percentage error and Mean squared error.
* data cleaning:
1. In the data cleaning, we removed the records of coordinates outside the New-York city and only kept data inside of New-York city.
2. then to remove the outlier of individual feature, we plotted the box plot then took the percentile of of individual feature. if difference between values of percentile is large then we are removing large one.
example.
99.0 percentile value is 18.17 99.1 percentile value is 18.37 99.2 percentile value is 18.6 99.3 percentile value is 18.83 99.4 percentile value is 19.13 99.5 percentile value is 19.5 99.6 percentile value is 19.96 99.7 percentile value is 20.5 99.8 percentile value is 21.22 99.9 percentile value is 22.57 100 percentile value is 258.9
here we can say that value should be less than 23. values greater than 23, we are removing.
we removed outliers from trip_times, trip_distance, Speed, and total_amount.
* next we choose the number of clusters (i.e in how many clusters we have to divide whole new york city):
to decide this for number of clusters we also calculated number of clusters for inter cluster distance <2 miles and >2 miles and minimum inter cluster distance.
example: On choosing a cluster size of 50
Avg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2): 12.0 Avg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2): 38.0 Min inter-cluster distance = 0.36495419250817024
2: Time-binning:
we converted our pickup time in the unix timestamp. so that it will be easy to bin it. here we decided to bin every month into the 10 minutes bin.
example: 1420070400 : 2015-01-01 00:00:00
from time 2015-01-01 00:00:00 to time 2015-01-01 00:10:00 will be in 1st bin.
from time 2015-01-01 00:11:00 to time 2015-01-01 00:20:00 will be in 2nd bin. so on.
if month has 30 days then (30*24*60)/10 = 4320 bins. where each bin is of 10 minutes.
if we take group by of cluster, pickup bins wrt number of pickups then we get the number of pickups in particular cluster and bins.
in some cluster, we don't have number of pickups and bins. (we calculated this by subtracting number of possible bins by available bins in the data.)
so we got data, somewhat like this.
this represents, number of missing bins in given clusters. Now we have to handle this data either by putting average or zero.
we do smoothing for jan_2015 data and fill zero for jan,feb, and march 2016 data.
* for feature engineering, we also computed fast Fourier transform.
fourier transform converted time series into frequency series.
example.
above one is month wise frequency plot of pickups. on x axis we have bins and y axis number of pickups.
if we zoom in we can see difference between start pick and end pick is around 144 (144 bins means 24 hours (144*10)/60=24 hours) that means at night time there is drop in the pickups.
so in FFT pick at the 0, 1/144, 1/288.
when our wave is centered at zero it has no peak at zero but our wave is not centered at zero hence for this peak will be at zero.
Comments
Post a Comment