Stack Overflow Tag Prediction
Here our data is huge (csv of 6GB) around 6000000 rows in database hence we used sqllite database to store the data from dataframe and to read and query data from db itself. Here our problem statement is to predict all possible Tags for questions of StackOverflow. Train data contains, and Test data has only ID, Title and Body. (Tags we have to predict) * Example Data Point. This is multi-label classification problem. because for each question number of Tags could be more or less i.e of different number. ex. for one question, number of tags could be 3 and for another one could be 5. * Data Cleaning- 1. We removed the duplicate records (same records of Title, Body and Tags). around 30% duplicate records deleted. * EDA and Analysis: 1. Take count of Tags wrt each question. 2. then we plot the graph of frequency of Tag occurrence. Top 100 Tags frequency. * Then we calculated number of questions per Tag counts. Preprocessing: First Approach: In this dataset we have title, q...