Home > Uncategorized > Sweet Summation Vector

Sweet Summation Vector

The popular way to represent a Text Document in vector space is by the summation vector (resultant vector) of all the (meaningful) keywords that formed the text document.

There are two ways to identify the keywords from the Document Text:

  1. Use all the words in the document text and depending upon their frequency promote them as keywords or drop them as noise ( words with higher frequency are generally noise words )
  2. In general, scientists maintain a keyword collection with which they do the lookup to identify the set of keywords that generated the document.

Both the methods have upsides and downsides. The trick here is to have a method by which we select only the contextually meaningful keywords from the document text.

Here is one of the method to enrich the document vector, assuming that the document content is homogenous ( few similar semantic contexts )

  1. Generate the summation vector ( resultant vector ) using all the chosen keywords
  2. Correlate all the chosen keywords individually against the resultant vector (look out for keywords that show negative correlation or very low correlation)
  3. Place a cutoff of correlation score to be 0.2 (when cosine similarity is 0.2, the angle between the word and resultant vector is around 75 degrees! )
  4. Remove the words that do not fit the cutoff (threshold) from our selection set of keywords
  5. Generate the Resultant vector again based on the chosen keywords ( after the above filtering )
  6. Iterate steps 2,3,4,5 to get the rejected words ( iteration rejections are to be appended to the master rejection set ) and accepted keywords.
  7. Stop iteration when there are no more keywords to be rejected.
  8. The final summation vector is the enriched resultant vector, which would model the document much closely than the first one we started with.
  9. We may correlate again the accepted and rejected keywords against the enriched resultant vector to witness the boost in maxima of correlation score for the accepted items and minima of correlation score for rejected items. ( the positive correlations become more positive and negative correlations become more negative in the due course of iterations ).
  10. The final set of the accepted items could be assumed as the actual set of keywords that generate the document.

Happy Vectorization…

Tags:
  1. No comments yet.
  1. No trackbacks yet.
You must be logged in to post a comment.