Stack Overflow Tag Prediction

Here our data is huge (csv of 6GB) around 6000000 rows in database hence we used sqllite database to store the data from dataframe and to read and query data from db itself.
Here our problem statement is to predict all possible Tags for questions of StackOverflow.


Train data contains,
and Test data has only ID, Title and Body. (Tags we have to predict)

* Example Data Point.



This is multi-label classification problem. because for each question number of Tags could be more or less i.e of different number. 
ex. for one question, number of tags could be 3 and for another one could be 5.

* Data Cleaning-
1. We removed the duplicate records (same records of Title, Body and Tags). around 30% duplicate records deleted.

* EDA and Analysis:
1. Take count of Tags wrt each question.
2. then we plot the graph of frequency of Tag occurrence. 



Top 100 Tags frequency.


* Then we calculated number of questions per Tag counts.


Preprocessing:


First Approach: In this dataset we have title, question and tags, where code in included in questions only. so here we'll separate code from question.
then we'll combine title+question. say word=title+question

then we'll remove anything other than letters from word then tokenize (tokenize by space). removed html tags and stop words (except C).
finally use steam the word. (stemer converts the words into it's root words).


Machine Learning Models: 

since it it multi-label class classification, we have to convert out Tags into multilabels so that we can use it as label to pass to our machine learning model.

to convert to multilabel, we use binary CountVectorizer. but here problem is that we have total 18000 tags and there is possibility that there will be many tags which will be for very few questions.

with 5500 tags we are covering 99.04% of questions.
so we'll select 5500 tags.

1. split the data into 80:20.
2. convert the words (questions+title) into vectors. we converted words into TFIDF.
3. Skmultilearn has it's own library for multi label classification. but when we tried to use it, we got an memory error. basically, it tries to convert the sparse data into dense. 
hence we used OneVsRestClassifier with SGDClassifier.

with this approach, we got low accuracy, macro F1 score, micro F1 score.


Second Approach: 

1. Separate the code from questions. 
2. give title weight age 3 times than questions.
3. word = title + title + title + question 
4. remove everything from word except A-Za-z0-9#+.\-.
5. remove stopwords from word except c and use stemmer.

 

This time we have selected 500 Tags.
6. used OneVsRestClassifier with SGD classifier and logistic regression.

SGD Classifier:


Logistic Regression:



Comments

Popular posts from this blog

taxi demand prediction II

Taxi demand prediction