JUSTICE

If we really care about women’s we need to speak up,wake up,rise and demand rights….Rape is the forth most common crime against women in india….People say India is unsafe and nothing is wrong in…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Search Engine Based on StackOverflow Questions

1.1 Description

Stack Overflow is the largest, most trusted online community for developers to learn, share their programming knowledge, and build their careers.

Stack Overflow is something which every programmer use one way or another. Each month, over 50 million developers come to Stack Overflow to learn, share their knowledge, and build their careers. It features questions and answers on a wide range of topics in computer programming. The website serves as a platform for users to ask and answer questions, and, through membership and active participation, to vote questions and answers up or down and edit questions and answers in a fashion similar to a wiki or Digg. As of April 2014 Stack Overflow has over 4,000,000 registered users, and it exceeded 10,000,000 questions in late August 2015. Based on the type of tags assigned to questions, the top eight most discussed topics on the site are: Java, JavaScript, C#, PHP, Android, jQuery, Python and HTML.

Problem Statement

Build a search engine based on StackOverflow questions, the search results should include the semantic meaning.

1.2 Source / useful links

All of the data is in 1 zip-file: stackoverflow.com-Posts.7z
inside the zip file there will be one XML file : Posts.xml

Note : we are not using this whole data for this case study, due to limited resources and time constraints

-> The questions are randomized and contains a mix of verbose text sites as well as sites related to math and programming. The number of questions from each site may vary, and no filtering has been performed on the questions (such as closed questions).

Data Field Explanation

Dataset contains 112357 rows. The columns in the table are:

Id=”1"
PostTypeId=”1"
AcceptedAnswerId=”3"
CreationDate=”2016–08–02T15:39:14.947"
Score=”8"
ViewCount=”436"
Body=”<p>What does “backprop” mean? Is the “backprop” term basically the same as “backpropagation” or does it have a different meaning?</p> “
OwnerUserId=”8"
LastEditorUserId=”2444"
LastEditDate=”2019–11–16T17:56:22.093"
LastActivityDate=”2019–11–16T17:56:22.093"
Title=”What is “backprop”?”
Tags=”<neural-networks><backpropagation><terminology><definitions>”
AnswerCount=”3"
CommentCount=”0"
FavoriteCount=”1"

Id : “2214”

Title :

Body :

Cosine similarity : Cosine similarity calculates similarity by measuring the cosine of angle between two vectors. This is calculated as

i am using ‘cliget’ extension to get the data from the‘ archive.org’ through command line in my google-colab.

after that i am extracting the zip file using python library called patoolib.

now that we have extracted the file , inside the zip file there will be one XML file : ‘ Posts.xml ’, we will parse that file and extract that the desired data from it.

to do that we will iterate through each row in the XML file ,the sample row is shown in section 2.1.2.1 above .

we will check if the ‘postTypeId’ ==1 means question post and ‘postTypeId’ == 2 means answer post, we want only questions so we will filter it .and we will extract title and body from that row.

after this we will copy this data into the DataFrame under the columns title and body. and save this as .csv file for further use.

for this case study we will be using 112357 data points.

now if there are any duplicate rows then we will remove them by using the code.

this will drop duplicates and will keep the first occurrence of the duplicate row.

here df is the name of the DataFrame that has our data.

Minimum length of the question : 20
Maximum length of the question : 34183

We will be drawing word cloud for the most common words that occur in our Dataset, to do that we will first remove the stopwords(the words that occur too often eg :- is,an,the,a) from the corpus and then print the word cloud.

Word Cloud for Question

preprocessing is basically cleaning the input corpus . the main aim of preprocessing is to remove redundant data and to make data suitable for machine learning models. The main steps that are included are:

1. Separate out code-snippets from Body:

in our input sentences the text contains the code also which is not needed as of now , the code section is in between <code> ….</code> tags and we will remove this using regular expression library in python.

2. Remove Special characters from Question title and description (not in code):

we will remove all the special characters and and numbers , only keep words.

3. Remove stop words (Except ‘C’):

we will remove all the stopwords (stop words are those words which occur very often like:is, am, the, there..etc). we will keep the word c because its a programming language.

4. Remove HTML Tags

in this section we will remove all the HTML components from the data.

in the above function we are using beautifulsoup library to remove parse the data using two parser (lxml and html5) ,lxml will search for XML tags and html5lib will search for HTML ones.

after that we are substituting all the words inside the <> arrows, removing new line characters and removing div word.

5. Convert all the characters into small letters:

in this we will be converting our text into lower case so as to remove the redundancy.for example : New York → new york, I → i, ,etc.

6. Use SnowballStemmer to stem the words:

stemming is basically used for getting the root word by removing the stems from the word. for example :“likes”, “liked”,“likely”,“liking” will be converted to ‘’like’’.SnowballStemmer is present in the ‘nltk ’ library , a library which is popular for performing natural language processing tasks.

now we will all the steps in a single function:

so in above code snippet we are basically implementing what I said in above 6 steps.

first we are removing the code , after that removing the HTML tags after that we are merging title and body columns,then lowering the alphabets then performing stemming on them.

INPUT:
<p>I have an absolutely positioned <code>div</code> containing several children, one of which is a relatively positioned <code>div</code>. When I use a <code>percentage-based width</code> on the child <code>div</code>, it collapses to <code>0 width</code> on IE7, but not on Firefox or Safari.</p>

<p>If I use <code>pixel width</code>, it works. If the parent is relatively positioned, the percentage width on the child works.</p>

<ol>
<li>Is there something I’m missing here?</li>
<li>Is there an easy fix for this besides the <code>pixel-based width</code> on the
child?</li>
<li>Is there an area of the CSS specification that covers this?</li>
</ol>

CLEAN OUTPUT:

‘‘‘ percentag width child element absolut posit parent internet explor absolut posit contain sever children one relat posit use child collaps ie firefox safari use work parent relat posit percentag width child work someth miss easi fix besid child area css specif cover ’’’

now that we have the preprocessed data we will store it in a csv to use it further directly.

Machine learning models take vectors (arrays of numbers) as input. When working with text, the first thing we must do come up with a strategy to convert strings to numbers (or to “vectorize” the text) before feeding it to the model.

The task that our models will try to solve will be vector-vector similarity, each model will predict the vectors and we will pass the query sentence as the vector(OF different size/dimension for different model) and compute the cosine similarity between them to find the most related/similar answer.

The word vectors are the vectors/arrays which tells the model about that particular word and how it is related to other words in the sentence.they are vector representations of a word/sentence. There are many ways of encoding words into vectors like: bag-of-words(This can be done by assigning each word a unique number. Then any document we see can be encoded as a fixed-length vector with the length of the vocabulary of known words. The value in each position in the vector could be filled with a count or frequency of each word in the encoded document.),tfidf(term frequency-inverse document frequency),etc. but these cannot capture semantic meaning of the sentence because they are simple and does not take order of words into account.

In our approach we will use word/sentence embeddings which contain the semantic meaning. Semantic meaning means it can learn relationships for eg: queen-king,man-woman will be closer ,and verb and tenses like walking-walked,swimming-swam will also be near to each other because they are very similar semantically.these vectors are of fixed size , more the dimensionality more information rich these word vectors/embeddings will be.

Note:

for all our model we will judge their performance by single query sentence “what is superclass in object orient programming?”.

we will split out data into train and test for our machine learning models.

BERT (Bidirectional Encoder Representations from Transformers) provides dense vector representations for natural language by using a deep, pre-trained neural network with the Transformer architecture. It was originally published by

This model has been pre-trained for English on the Wikipedia and Books Corpus using the code published on GitHub. Inputs have been “uncased”, meaning that the text has been lower-cased before tokenization into word pieces, and any accent markers have been stripped. For training, random input masking has been applied independently to word pieces (as in the original BERT paper).

for this assignment we will be using BERT from TensorFlow :

BERT is a basically a part of the transformer model. The transformer is a encoder-decoder model. In BERT we use the encoder part of the transformer.

for this assignment we are using BERT base which has 12-identical encoder stacked at each other. lets see whats inside each encoder unit.

single encoder unit

As we can see there are two parts inside this encoder unit

At first the feed forward neural network is the fully connected with 512 layers/units.This consists of two linear transformations with a ReLU activation in between.

FFN(x) = max(0;xW1 + b1)W2 + b2 (2)

While the linear transformations are the same across different positions, they use different parameters from layer to layer. Another way of describing this is as two convolutions with kernel size 1. The dimensionality of input and output is d = 512, and the inner-layer has dimensionality d’ = 2048.

second is self attention layer, which plays important role in giving attention to the words that are important and help the model to understand to context better.for eg: take the sentence

'The animal didn't cross the street because it was too tired'

here it refers to what? animal or street?

it is simple for us humans to associate this with animal but for model to understand this it has to implement this self attention which will allow to associate it with animal.

so lets dive into the self attention part now:

attention step 1

The first step in calculating self-attention is to create three vectors from each of the encoder’s input vectors (in this case, the pre-trained embedding of each word). So for each word, we create a Query vector, a Key vector, and a Value vector. These vectors are created by multiplying the embedding by these three matrices that we trained during the training process.

these vectors are of fixed size 64 as taken in the original paper.

Look at the image above we have input thinking machines.

Lets consider the first word ‘thinking’. first of all this input will be converted to the embedding of fixed length. after that we generate the three vectors as mentioned above. after that we will calculate the score by multiplying the query vector with all the key vectors (all the key vectors in the sentences before or after), after that the results of each multiplication is divided by square root of the dimensions of the keys vector which in our case is 8 because the dimension of keys vector is 64. after that we will pass the output of these numbers to a softmax function to generate/squash a score between 0–1.Then multiply these values(result of softmax) by ‘value vector’ and finally sum them up to produce the z1 vector for each word.

similarly we will calculate this z1 for all the words in the sentence.

the paper further refined the model by adding multi attention heads to expand the representational power of the model, in our case we are using 12 attention heads, so we will have 12 keys, values,queries matrices for each word.

The feed-forward layer is not expecting eight matrices — it’s expecting a single matrix (a vector for each word). So we need a way to condense these 12 down into a single matrix.

so we concatenate them and multiply them with another weight matrix to get the single matrix as output.

so the summary will look like this.

attention summary

Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. To this end, we add “positional encodings” to the input embeddings at the bottom of the encoder.

The input to the model is not only the embedding but the summation of two matrices one of which is embedding and second of which is positional embedding which is To give the model a sense of the order of the words, we add positional encoding vectors — the values of which follow a specific pattern.

We employ a residual connection around each of the two sub-layers, followed by layer normalization . That is, the output of each sub-layer is
LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension = 512. the figure below shows the full working of the single encoder block.

If you want learn more about transformer and BERT refer this excellent blogs(the explanation that i have done is inspired from these blogs only) :

Now that we have learned all about BERT lets implement it.

In the above piece of code we have prepared the model which will take 3 inputs

This BERT model is pre-trained so we don't need to train it , we will directly predict the vector representation of the sentences by passing the sentence tokens, sentence mask,sentence segment to the model using model.predict.

after we get the 768 vector representation of every word then we will compare it with the query word and return the most common sentence as follows.

So whats happening in the above code is first we are initializing the all the three arrays with zeros, then we are filling values by calling ‘tok ’ function that we showed above to get tokens,mask,segment. After that we will pass that to the BERT model to predict the output of the query word. after that we will compute cosine similarity between he predicted array and all the arrays that we have in our corpus .we will return top 5 or 10 results as list of strings.

prediction1: Custom view transition in OpenGL ES I’m trying to create a custom transition, to serve as a replacement for a default transition you would get here, for example:I have prepared an OpenGL-based view that performs an effect on some static texture mapped to a plane.

prediction2 : Dynamic contact information data/design pattern: Is this in any way feasible?I’m currently working on a web business application that has many entities (people,organizations) with lots of contact information ie. multiple postal addresses, email addresses, phone numbers etc. At the moment the database schema is such that persons table has |

prediction3: Databinding with Silverlight If I want to bind a collection to a some form of listing control in Silverlight. Is the only way to do it so make the underlying objects in the collection implement INotifyPropertyChanged and for the collection to be an Observablecollection? If I was using some sort of |

prediction5: passing void to a generic class I’m trying to create a form that will animate something while processing a particular task (passed as a delegate to the constructor). It’s working fine, but the problem I’m having is that I can’t instantiate a copy of my generic class if the particular method |

for this case study we will be using genism’s Doc2Vec Model , this model is very similar to Word2Vec model, which we will discuss in next section. actually Doc2Vec model is an extension to Word2Vec model.

doc2vec architecture

This framework is similar to the CBOW architecture shown in word2vec section , the only change is the additional paragraph token that is mapped to a vector via matrix D. In this model, the concatenation or average of this vector with a context of three words is used to predict the fourth word. The paragraph vector represents the missing information from the current context and can act as a memory of the topic of the paragraph.

The paragraph token can be thought of as another word. It acts as a memory that remembers what is missing from the current context — or the topic of the paragraph. For this reason, we often call this model the Distributed Memory Model of Paragraph Vectors (PV-DM). The contexts are fixed-length and sampled from a sliding window over the paragraph. The paragraph vector is shared across all contexts generated from the same paragraph but not across paragraphs. The word vector matrix W, however, is shared across paragraphs. I.e., the vector for “powerful” is the same for all paragraphs.

PV-DM is same as CBOW algorithm ,which is mentioned in word2vec section in detail.

Matrix D has the embeddings for “seen” paragraphs (i.e. arbitrary length documents), the same way Word2Vec models learns embeddings for words. For unseen paragraphs, the model is again ran through gradient descent (5 or so iterations) to infer a document vector.

first step of this model is to make data suitable for genims Doc2Vec implementation, this model takes list of tokens of string and unique tag as input.

so tokens here are words in a single document ,and tags will be just index number of that document in corpus.

The above functions prepares the data for us as follows:

Else, we will tag each document, to do that we will use gensim.models.doc2vec.TaggedDocument() function which will take tokens and unique tags as input and return tagged document as output. for us tags are nothing but line numbers so we will pass them along with the tokens to get tagged documents.

for train data we want tags and for test we just want list of tokens as input to predict that's why we are using two conditions.

now that we have data prepared we will initialize the model as follows.

See its just one line of code to define a model in genism.

Now, lets understand each of the parameters.

Now that we have initialized our model we will build vocabulary of all the words in the training corpus by

Essentially, the vocabulary is a dictionary (accessible via model.wv.vocab) of all of the unique words extracted from the training corpus along with the count (e.g., model.wv.vocab['penalty'].count for counts for the word penalty).

Now we will train our model.

We will train the model for 50 epochs .

Now we will find the most similar document by providing a query word.

Now that we have our sentence in vector format we will find most similar vector from the training corpus . To calculate distance/ similarity between the vectors genism by default calculates cosine similarity as a distance metric.

this will return us the list of tuples, inside of which first value is index of the document and second is cosine similarity between query and the document of which the index was provided in the first place.

here are top 5 results :

i know the results are disappointing :’-(

so basically this model failed to capture the crux of a document in a single vector.

First lets understand what is TFIDF.

Tf-idf stands for term frequency-inverse document frequency, and the tf-idf weight is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.it tends to give more importance to the less frequent words in the corpus.

Typically, the tf-idf weight is composed by two terms:

See below for a simple example.

Example:

Consider a document containing 100 words wherein the word cat appears 3 times. The term frequency (i.e., tf ) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf ) is calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12.

now lets come back to word2vec,

The main goal of his paper is to introduce techniques that can be used for learning high-quality word vectors from huge datasets with billions of words, and with millions of words in the vocabulary.

Main focus was to develop word embeddings which has multiple degrees of similarity , which means apart from words which are similar tends to be closer to each other.The model was also able to understand the complex relationships between the words.Somewhat surprisingly, it was found that similarity of word representations goes beyond simple
syntactic regularities. Using a word offset technique where simple algebraic operations are performed on the word vectors,

it was shown for example that vector(”King”) — vector(”Man”) + vector(” Woman”) results in a vector that is closest to the vector representation of the word Queen

We can see from the above example it ht embedding are so well that by performing simple addition and subtraction on vector results in a vector which is closely related to the result of these operations.

In the paper milkov proposed two models for word2vec :

word2vec models

As we can see from the above figure both the models are opposite of one another.Both the models are trained using stochastic gradient descent and backpropagation

Lets talk about CBOW first,

CBOW main objective is to predict focus word given context words, for example if we have a sentence like ‘Raman is a good boy’, so all the words will first be encoded into one hot encoding,where the size of the one hot encoding vector will be the size of the vocabulary(vocabulary is the collection/dictionary of all the words that are present in the training corpus),so first our training dataset will be constructed like :

cbow dataset

but instead of the words there will be one hot encoding of these words as shown in word2vec models figure.

as we can see the data replication is done so the corpus size is much so less data is needed to train word2vec.

As we can see above ,this problem can be seen as a multiclass classification problem,so first we encode our words in one hot encoding which will convert all our words into v-dimensional binary vectors where v is the size of vocabulary or the total number of words in our corpus, in that vector the index corresponding to that particular word will be 1 and others will all be zeros,

then we pass v-dim vectors to the hidden layer of the CBOW , this hidden layer is nothing but compose of N-linear activation function , which will pass the input as it is to the next layer.there are total N units of such functions which makes up this layer,and each and every input is connected to the layer N activation units.formula for that is as follows:

word2vec is not a deep learning algorithm because of this linear activation function, in deep learning models the activation functions are non linear so they can project complex relationships between vectors.

where w is weight matrix is initialized as random numbers from Gaussian distribution they will be learned/updated through backpropogation.

if you want to learn about forward and backward propagation refer this blog.

all the fully connected layers means that all the vectors from that layer are connected to the all the function of next layer.i have drawn single arrow just to make the diagram readable in reality these arrow will be connected to all the upcoming layer inputs, for eg take our fully connected layer 1 its size is of vx1 , where v is number of rows and one column, and the next layer has N number of units so all ‘v ’vectors will be connected to all the ‘n ’units of the hidden layer and each of them will have weight corresponding to it.This weight as we discussed will be updated through back-prop.i have also shown a weight matrix in the diagram which is of shape (NxV),where,

w2v(wi) = vec j , where j belongs N dimensions.

then this hidden layer is connected to a softmax layer, this softmax will give probability output for all v vectors and the probabilities will sum up to 1.

the formula for softmax function is :

softmax function

Where j is the word index which in our case is V and i is the number of total inputs to that function in our case its the inputs from hidden layer so k is N.

The output of this function will give us the probability of each predicted word, its probability so they all will sum to one(1), but the one the with the highest probability will be assigned as 1 and others as zeros we will do this using argmax function which will give us the index of the largest value. and we will set it as 1 and rest zero so the output will be also of V dimensional i.e one hot encoded.

then we will compare this output with the encoding of the correct word from our vocabulary and calculate the loss.

after that we will backpropogate and update the weights accordingly .

this process will be repeated until there is no major change in the weights and then we will stop the training the network and represent the results, which will be the weight matrix of the fully connected layer 5 , which will be of shape (NxV) .so for every word in the vocabulary V , we have a fixed N-dimensional vector.

now lets see skip-gram,

It is exactly opposite of the CBOW model, we try to predict context words given focus word.so basically at the output it will have K-softmax layers, which means it will take more time and is computationally expensive

We use skip-gram when we have smaller dataset because it can work well this that and it is also useful in predicting infrequent words, the words which do not occur very often in our training corpus.In our case study we are using CBOW only so i will not bore you with the skip-gram model details.

first of all we will convert our questions column into list of sentences to make it feasible for our word to vec model.

As you know we are using tfidf weighted word to vec ,for encoding our words into tfidf vectors we are using sklearns TfidfVectorizer .

In the above piece of code we are using tfidf vectorizer, the variable tfidf will contains matrix of shape(number of rows,number of tokens/unique words)

for each word it will contain the idf value of that particular word if it exist else it will be zero, so the matrix will be sparse.

In the dictionary variable we are storing all the tokens/feature names as keys and their corresponding idfs as values to use further .

Now we are defining our word 2 vec model , you can see its just one line of code the parameters are as follows:

We are training our own word2vec model rather than using pre-trained on google news one because it was giving good results than pre-trained model.

We chose vector size of 50 because of time constraint that we have , by using this size we were able to get results within 0.2 seconds and results were also very similar to the word2vec with vec size of 200,300.

Now as of this point we have our tfidf feature output and word2vec models output now we will combine them and generate tfidf weighted word2vec as follows.

The above code is self explanatory, explanation is written in comments.

here are top 5 results:

“Query : what is superclass in object orient programming”

Here at the end, results are good enough.

2. I would like to try fasttext also.

3. I would like to try all these with larger data corpus, as of now i am considering only 1.2 lac data points , i would like to try it with 5–6 lac data points in future and see the results.

I have deployed a my search engine with aws ,have a look and search for results on you own:

SatckOverflow Search Engine link:

THANK YOU FOR YOUR TIME :)

Add a comment

Related posts:

Running Linux on Windows Tablet

In this blog I will be narrating my experience of how my search for running Linux on my Acer One 10 Windows Tablet ended. And key take aways from my experince. And yes I am writing this blog on my…

PAPER FLOWER

Namjoon berhenti di depan pintu kamar begitu melihat Seokjin sedang merangkak naik ke atas kasur seraya memeluk boneka dinosaurus hijau barunya. Omega itu tampak senang dengan kehadiran anggota baru…

Eleanor Roosevelt was a Blogger

Eleanor Roosevelt wrote an article called "My Day" nearly every day for 26 years. Her pre-cursor to blog posts is an incredible look into daily life from 1935 to the 1960s.