Archive

Archive for the ‘Text Mining’ Category

Symmetric Sigmoid

May 14th, 2011 No comments
When unknown signed intervals are expected to be normalized to -1 to +1, a symmetric sigmoid function could prove very useful.  A Symmetric Sigmoid function is a modified version of the Sigmoid function by stretching it across the Y axis.  As per wikipedia, it is defined as a symmetric function is a special case of logistic function whose plot appears like a ‘S’ with the Y-axis intercept as 0.5 and min/max as 0/1 respectively.

The function definition for a sigmoid is:

In programming languages the implementation should be in order to overcome the numerical overflow and underflow issues:

if ( t < 0 )
    exp( t ) / ( 1.0 + ::exp( t ) )
else
    1.0 / ( 1.0 + ::exp( -t ) )

 

 

 

 

 

 

 

 

 

 

 

 

In the plot, if the curve is stretched in Y-axis to make the Y-intercept at 0.0, the lower bound would get stretched to -1.0.
The function definition for a symmetric sigmoid is:
P symmetric (t) = 2.0 * P(t) – 1.0

The function can also be defined in terms of hyperbolic tangent as:
P symmetric (t) = tanh(t)

With sigmoid functions, it becomes possible to normalize any range in [-∞, +∞] onto [-1.0, +1.0].

Know the Social Network that’s spun around you

December 20th, 2010 No comments
Social Networks have become a buzz in the contemporary world.  Most of us use them without understanding what they are and what they could do.  Basically, social networks are a platform the enable friends and family to be touch.  A good deal eh? Yes, it started just like that.  It started as a proxy for a pubs, playgrounds, parties where friends hang around.  Instead of going  for meeting a person, these social networks enabled friends to be in touch over instant messaging, blogging, scribbling, sharing pictures and videos by making use of the Internet technology.  This is definitely good, so far.  Do you know what else these social networks could do to you?  Let’s see some possibilities for an arbitrary ‘X’ whom you never know who that is:-
  1. Find your friends.  If you are little dumb, one can easily find your crush, hated ones, and what not.
  2. Find your likes and dislikes based on your messages, scribblings
  3. Know your places of interest based on what you have written about places, pointers on meeting spots that you chose to meet your friends, etc.
  4. Potentially know your neighborhood, by going through your blogs of scribblings about your neighbors or happenings in your vicinity.
  5. Find out what food you eat, and what you are allergic to.
  6. Find out what type of network connection you have, based on your connectivity logs.
  7. Find out paths from friends-of-friends for connecting an arbitrary person to you through your contacts.
  8. Find out which communities you belong to, and hence your social and professional networks
  9. Find out where you work, what your hobbies are.
  10. Find out where you studied, what you studied, what you aspire and where you are heading to; based on your professional community memberships, activity in forums, answers queries, etc.
  11. Find who you would meet, when and where based on your conversations
  12. Find out your mood swings, based on the vocabulary of your conversations
  13. Find out your pictures, and the pictures of the related ones in family and profession.
Remember something, whatever is meant to be private is never private in Social networks or any other electronic 3rd party services for that matter.  I have listed a very brief potential possibilities.  Please use these platform with fullest caution.

One side of the Internet researchers are working on “Internet Anonymization”, an idea to protect one’s identity; on the other hand, Internet giants are pushing “Social Networks” where they build intelligence on the so-called privacy of people.  What a weird world !?!

Powered by ScribeFire.

Tamil Steganography

December 3rd, 2010 2 comments

A nice discussion on Tamil steganography that is worth sharing.. 🙂

Udayasankar said:

Came via shakty’s blog, interesting note on tamil and cryptography.  would like to know of any specific instances where a cryptic text and associated decipher key is available in Tamil. Are there known historical incidents in Tamil History where a use of cryptographic keys and texts are available, say similar to English Queen Mary’s use of cipher text which indirectly led to her execution. Or the Caesar cipher for instance. Any incidents you can point out.

Sudarshan, as in your comments on the blog, I am aware that the siddhas and mystics used indirect language and metaphors to express themselves to their select cliques, but then that does not amount to more than children making up their own dialects to bypass their elders.  could you point out any place where a real mathematical technique is used? for instance are there kalvettukkal or temple inscriptions which are in code?

I was wondering if the tamil language did indeed have a crypto concept that has mathematical backing as sakthy seemed to imply on his blog.  From what he wrote I took it to mean that this was a possibility rather than an established practice. However, the need to hide data is a universal impulse and i wondered if indian science and math did have some background in this. as to why specifically tamil, I will explain in a later mail. i am already aware that the mystics both Tamil and others, (much like the scientists of the eighteenth and earlier centuries.  Leornado, Newton, Leibnitz) coded their discoveries in cryptic text.

My response is:

i am not of an opinion that tamil were doing mathematical modeling on things that they practiced naturally like breathing.  the concept of mathematical modeling and theoretical proofs are the ideas of the west. if i am correct, our forefathers were men of practice, primarily based on observation sciences.   people observed the neighborhood carefully and identified interesting patterns. and these patterns are then connected to suitable inferences that were derived based on more observation and tuning..  if i have equate that to the current technology, the word is “statistical machine learning” without the approximations and model fitting.

i strongly believe that our men had insights which are passed on to their students as experiences on insights instead of mathematically explaining their insight like the west did.  our men had methods to conceive and transmit ideas without words (ex: bhagavath geetha 800 stanzas transmitted from the krishna “character” to the arjunan “character”).  our men understood non-linear time that exists, unlike the linear time definition of the west.  our men believed and practiced thought process instead of scripts.

ofcourse encryption was used heavily in the past, just in term of metaphors, symbols.  our guys did not work at the character by character level, instead at the context and semantics level.  anything that is semantic is difficult to model because of the heavy polysemous tamil language.

Text Mining Question Bank

July 19th, 2010 No comments

1. Natural Language Processing

  1. Give 5 examples for Holonyms, Hyponyms, Hypernyms, Metonyms, Meronyms, Homonyms, Synonyms, Polysems.
  2. Draw the Venn diagram of Spellings-Meanings-Pronunciations.
  3. Why are Context Free Grammars Context free ?
  4. What is the difference between RTN and ATN ?
  5. Give examples of Prepositional Phrases.
  6. Compare CFG and ATN.
  7. Give 5 examples for Anaphora, Cataphora, Endophora, Exophora.
  8. Give 5 examples of NP ellipsis, VP ellipsis.
  9. Write a CFG, ATN for the following:
    1. “Tech Companies queue up for Open Source Professionals”.
    2. I love my language.
    3. Patriotism is not about watching cricket matches together.
    4. AMD’s microcode is more richer than Intel.
    5. Ron Weasley should marry Hermoine Granger.
    6. Krishna is a metonym for uncertainty.
    7. PMPO is 8 times that of RMS power measured for a 1KHz signal with an amplitude of 1V.
  10. What are the Named Entities in
    1. “Open Source helps Life Spring Hospitals” ?
    2. I want to work for Burning Glass Technologies Inc.
    3. The university life at SRM is very informal.
    4. AMD Phenom 5500 Black Edition can be unleashed to 4 cores.
    5. Hail Hitler!
    6. Anushka is taller than Surya.
  11. Do NP chunking on
    1. Tips and Tools for measuring the world and beating the odds
    2. The crazy frog is an awesome song
    3. Time flies like arrow.
    4. Thevaram was written by Appar.
    5. Text mining is awfully interesting.
    6. I need to get placed is a good company.
  12. Write a Regular Expression for replacing the beginning and end of all the lines in a text file with the strings “” and “” respectively.
  13. Write a regular expression for capturing Indian mobile numbers, land line numbers and Indian pin codes with maximum possible inherent validation.
  14. Write a regular expression for capturing the vehicle numbers, PAN numbers, Passport numbers in a new paper article.
  15. Identify rules to capturing dates and discriminating the job dates, education dates and date of birth.
  16. Give examples for Noun stemming in English & {Tamil or Telugu or Hindi} languages.  Transliterate the Indian language.
  17. Give examples for Verb stemming in English & {Tamil or Telugu or Hindi} languages.  Transliterate the Indian language.
  18. How does a spell checker work ?
  19. Take some arbitrary texts and summarize them in to a line or two.  Justify the reason for the choice of words and sentences in your summary.
  20. Show some examples for word-by-word, sentence-by-sentence, context-by-context machine translation.

2. Information Extraction & Statistical NLP

  1. If Prob(A) is 0.4 and Prob(B) is 0.6, what is Prob(A,B), Prob(A|B), Prob(A u B), Prob(A – B), Prob(A n B) ?  If some data is missing, assume a reasonable value for it.
  2. Let A be a random variable with instances a1, a2, a3, a4, a5.  If P(a1) = 1.8e-4, P(a2) = 5.2e-8, P(a3) = 0.042, P(a4) = 0.00052, P(a5)=0.2, compute Sigma P(A), PI P(A) without mathematical underflow.
  3. Give real life examples for 1st order markov processes.
  4. Give real life examples of Expectation-Maximization.
  5. If p[[0.1 0.3 0.2 0.4],[0.3 0.4 0.2 0.1],[0.3 0.3 0.1 0.3], [0.2 0.4 0.1 0.3]] is the state transition probability of any 4 states {A,B,C,D} in a HMM, calculate P(A->B->C->D).
  6. Based on (5), check whether the probability of state sequence is commutative (ex: P(A->B->C) = P(C->B->A) ?)
  7. If the observation probability is [[.2 .4 .1 .3], [.6 .1 .0 .3], [.0 .0 .0 1.0], [.1 .1 .1 .7], [.4 .4 .1 .1]] for observations {i, j, k, l, m} in states as per(5). Compute the P(O={k,l}).
  8. Annotate the items in (9) of Section 1 and build the state transition, observation, initial probability matrices.
  9. Show that usage of forward probabilities reduce the time-complexity of evaluation problem.
  10. Show that usage of forward-backward probabilities reduce the time-complexity of decoding problem.

    Powered by ScribeFire.

DevCamp 2010 by ThoughtWorks Inc., Chennai.

July 11th, 2010 No comments

Developer Camp 2010
10th July 2010, Chennai

It was my first attempt to take part in a BarCamp / unconference, which excited me very much after reading about them in Wikipedia.  Through some contacts, I was invited to attend the Developer Camp hosted by ThoughtWorks Inc, at Thiru Vi. Ka. Industrial Estate, Ekkattuthangal, Chennai on 10th July 2010.  I had originally offered to give a couple of talks on Text mining and Design patterns.  Though I had some anxiety about whether topics like Text Mining would sell amongst hard core developers, I was comforted by Balaji Damodaran (organizer) that there should be a lot of people interested in exploring AI.

    I reached ThoughtWorks office at 9:15AM and was surprised to find atleast a couple of dozen developers already come in.  Saturday morning for hard core developers start only after 11AM, but I was happy to be wrong then 🙂 Registered myself as one of the developers and opted to talk about “Text Mining Applications”, “Plagiarism Detection”, “Text Classification using Naive Bayes”, “Design Patterns” for the 9:30AM slot.  The unconference started at around 9:45 with the introduction by Balaji Damodaran.  At that time, atleast 70 developers were there in the hall (cafetaria).  Then I was asked to start the talk by 10AM.  When I went to the hall, it had only 5 people as audience, which kind of killed me as I am always used to having big crowd as my audience (what an EGO I have!?).

   I had asked a couple of the audience boys to go for hunting more audience for the talk.  See I were to advertise and promote my talk, which in fact is critical for everything in the world we live.  One of the volunteers advised to use a microphone and start the talk.  When I started the talk, I was surprised to see that people walked in to fill up the hall.  The talk went on and on with a lot of interesting examples which made everyone introspect about the way we see and assess our neighbourhood.   I am sure my audience have understood now that everything that we see around and solve could be mathematically modeled and be solved using computers.  Hurray, we made it!!

    Followed by that talk, I was asked to talk about Design patterns as a lot of developers had voted for that topic.  Ok, I wanted a coffee break! Went to the cafeteria and made some light south Indian coffee.  I added some pulverized sugar to my coffee and came back to the hall, while I was talking with another developer from LatentView technologies.  To my surprise, the coffee tasted like made with sea water. Then I realized that I had added salt instead of sugar.  I would like to greet the “brahaspathi” who kept the salt bowl near the coffee vending machine. 🙂

    The talk on Design pattern started in a small room as the number of votes was ~10 (which is still a large number) in unconferences. When we started that talk, one of the volunteer said, he would want to record the talk which is a good idea. The talk started, and we found that lot of people started to come into the room and we had to move to a bigger hall as the number of audience was over 40, which is like “wow”. The talk went on for a while and we interacted about Singleton vs Multiton, Strategy, Factory vs Bridge patterns with lots of examples. Overall, it was a wonderful discussion forum where we learned a lot of insight about software design using design patterns.

    If I were to use one word to describe the audience, I would say “intriguing”.  It was an awesome experience for me to talk about some of my experiences to a wonderful audience that you had brought it.  It is very rare to find a combination of patient, smart, involved, intelligent, experienced audience who crave for knowledge.  Our talks helped us to introspect on to the technology that we have been practicing. The ambiance was very motivating in the sense, lot of natural light and spaciousness.  Overall, I enjoyed every bit of it.  I am little depressed that I could not enjoy the food as I was rushing back to office.  Also, I wanted to take part in the fish bowl about Industry-Academic Co-op, but couldn’t.  I am sure, there is a lot of people who got benefited by this program, in fact I heard that statement from a lot of the audience after the lecture/talk.

Thanks to Shiv Deepak for introducing DevCamp.
Thanks to Balaji Damodaran for inviting me to the DevCamp.
Thanks to Shaswat Nimesh for the photographs.

EFYTimes news article is here.

Powered by ScribeFire.

Endometriosis, The Turning Point of my Life

April 17th, 2009 No comments

Endometriosis (from endo, “inside”, and metra, “womb”) is a medical condition in women in which endometrial cells are deposited in areas outside the uterine cavity. The uterine cavity is lined by endometrial cells, which are under the influence of female hormones. Endometrial cells deposited in areas outside the uterus (endometriosis) continue to be influenced by these hormonal changes and respond similarly as do those cells found inside the uterus. Symptoms often exacerbate in time with the menstrual cycle. Endometriosis is typically seen during the reproductive years; it has been estimated that it occurs in roughly 5% to 10% of women. Symptoms depend on the site of implantation. Its main but not universal symptom is pelvic pain in various manifestations. Endometriosis is a common finding in women with infertility.

Excerpts from Wikipedia (Original Link

Professionally, this word “Endometriosis” served as a turning point in forcing me to think beyond and invent better technology for text processing systems. As you know, I am a text mining scientist. My full time job is to make computers understand English text.  Lately, I had built a system that would identify context of supplied text (I process Resumes and Jobs) based on co-occurrence patterns.  When I clarified for “secretary” in my system, the system came back and said “endometriosis” is the closest keyword in the same context.  I was perplexed and decided to tract the source.  Interestingly I found that the documents that contained the keyword “endometriosis” were resumes of secretaries who had worked for doctors treating endometriosis.  Here the technique is not wrong, but the data set that is used for building the system was skewed.  To overcome this defect of unsupervised system, I was forced to device an advanced semi-supervised system where noisy and completely-erroneous prediction like above could be taken care of.  This “endometriosis” case has helped me think wider an develop better technology that made my life easy and my inventions more accurate.

Powered by ScribeFire.

Randomness

July 17th, 2008 2 comments

Lately, I was wondering how to generate random numbers without
seeding. Oh man, I could not think of randomness with a random seeding.
Now, I am realizing that random numbers cannot be generated without the
impregnation of a random entity. Generally we use the the current time
(in seconds) for the seeing purpose.

Ok, does the above mean
something else in real life ? Does it mean that there is nothing called
randomness ? Does it mean that anything and everything can be described
by a Generative model (A Generative model tries to establish the entire
distribution based only on the parameters that generate the
distribution; for example words generate documents and topics generate
words ) ? I remember hearing something about chaos theory which
dictates that behind any chaos, there exists a pattern. So, is our life
like a flowchart with the next steps hidden ? Are we calling this
flowchart the fate of life ?

Are we calling people who have
realized these patterns of life as Gods ? What are the ways to learn
these patterns by oneself ? Can one learn and accept his own pattern of
life ? Are we calling this learning process as the salvation ? If there
exists a generator function behind everylife, who decides on the
generator function ? Can one decide his own generator ? Do we call this
as winning one’s fate ?

If every life is like a flowchart, what
is the beginning and what is the end ? Will we have recursions and
loops in the flowchart ? Are we calling every loop in the flowchart as
one life ? If so, does one exist across multiple births and deaths ? If
so, who is that one ? If we agree on the loops, does the generator
function maximize or minimize our stay in this world ? What is the loop
termination condition? Is that a likelihood? How are we deciding the
likelihood of life? Are we comparing the actual living and ideal living
? What is ideal living ? Is Ideal living a process of learning and
accepting the generator function ? Will one stay quiet if the generator
function is learnt and accepted ? Can the generator function be also
called the destiny ? Does it mean if one knows the destiny, he becomes
quiet ? May be..

Ok, If living in the world is not the ideal
living, should we call this world as a punishment bench ?
Optimistically, shall we call it a training centre ? What is the real
purpose of this generator function ? From the above I can understand
that I am responsible for my own re-birth. Can there be no re-births at
all ? If the reason for re-birth is me, what is the reason for my first
birth ? Moreover, why was many first births (meaning many parallel
co-existed souls) ? And when all the loops of all the souls are ended,
what is next ? Is that the real doom’s day ? Am I going beyond the
scope of what I am ? Is my generator function designed to make me ask
these questions? May be..

Sweet Summation Vector

July 17th, 2008 No comments

The popular way to represent a Text Document in vector space is by the summation vector (resultant vector) of all the (meaningful) keywords that formed the text document.

There are two ways to identify the keywords from the Document Text:

  1. Use all the words in the document text and depending upon their frequency promote them as keywords or drop them as noise ( words with higher frequency are generally noise words )
  2. In general, scientists maintain a keyword collection with which they do the lookup to identify the set of keywords that generated the document.

Both the methods have upsides and downsides. The trick here is to have a method by which we select only the contextually meaningful keywords from the document text.

Here is one of the method to enrich the document vector, assuming that the document content is homogenous ( few similar semantic contexts )

  1. Generate the summation vector ( resultant vector ) using all the chosen keywords
  2. Correlate all the chosen keywords individually against the resultant vector (look out for keywords that show negative correlation or very low correlation)
  3. Place a cutoff of correlation score to be 0.2 (when cosine similarity is 0.2, the angle between the word and resultant vector is around 75 degrees! )
  4. Remove the words that do not fit the cutoff (threshold) from our selection set of keywords
  5. Generate the Resultant vector again based on the chosen keywords ( after the above filtering )
  6. Iterate steps 2,3,4,5 to get the rejected words ( iteration rejections are to be appended to the master rejection set ) and accepted keywords.
  7. Stop iteration when there are no more keywords to be rejected.
  8. The final summation vector is the enriched resultant vector, which would model the document much closely than the first one we started with.
  9. We may correlate again the accepted and rejected keywords against the enriched resultant vector to witness the boost in maxima of correlation score for the accepted items and minima of correlation score for rejected items. ( the positive correlations become more positive and negative correlations become more negative in the due course of iterations ).
  10. The final set of the accepted items could be assumed as the actual set of keywords that generate the document.

Happy Vectorization…