Posts

Stack Overflow Tag Prediction

Image
Here our data is huge (csv of 6GB) around 6000000 rows in database hence we used sqllite database to store the data from dataframe and to read and query data from db itself. Here our problem statement is to predict all possible Tags for questions of StackOverflow. Train data contains, and Test data has only ID, Title and Body. (Tags we have to predict) * Example Data Point. This is multi-label classification problem. because for each question number of Tags could be more or less i.e of different number.  ex. for one question, number of tags could be 3 and for another one could be 5. * Data Cleaning- 1. We removed the duplicate records (same records of Title, Body and Tags). around 30% duplicate records deleted. * EDA and Analysis: 1. Take count of Tags wrt each question. 2. then we plot the graph of frequency of Tag occurrence.  Top 100 Tags frequency. * Then we calculated number of questions per Tag counts. Preprocessing: First Approach:  In this dataset we have title, q...

taxi demand prediction II

Image
Now after cleaning, preprocessing we have data of jan 2015 and jan, feb, and march 2016. Now we will use some base line models to predict pickups in the next 10 minutes. 1. Simple Moving Average of ratio: 2.  Using Previous known values of the 2016 data itself to predict the future values Rt = Pt(2016) / Pt(2015) The First Model used is the Moving Averages Model which uses the previous n values in order to predict the next value for predicting pickup value of next 10 minutes, we have to take ratio of previous n 10 minutes bins. where n is hyper parameter. in ratio, we use both 2015 data and 2016 data. next is moving average of only 2016 values In the here also n is hyperparameter. we observed that using n=1 i.e just using previous value, we are getting minimum error. In the previous know values, we use only 2016 data. 3. Weighted moving average of ratio: Here N is the hyper parameter. 4. Weighted moving average of previous values: 5. ...

Taxi demand prediction

* Problem statement: Here we want to predict demand of taxi in next 10 minutes in each cluster. * Solution: Here data is huge hence we used dask library (dask divide large data into smaller chunks and later combine them) ['VendorID', 'tpep_pickup_datetime', 'tpep_dropoff_datetime', 'passenger_count', 'trip_distance', 'pickup_longitude', 'pickup_latitude', 'RateCodeID', 'store_and_fwd_flag', 'dropoff_longitude', 'dropoff_latitude', 'payment_type', 'fare_amount', 'extra', 'mta_tax', 'tip_amount', 'tolls_amount', 'improvement_surcharge', 'total_amount'] we have this data. * Metrics: Mean absolute percentage error and Mean squared error. * data cleaning: 1. In the data cleaning, we removed the records of coordinates outside the New-York city and only kept data inside of New-York city. 2. t...