Posts Tagged ‘Text Mining’

Know the Social Network that’s spun around you

December 20th, 2010 No comments
Social Networks have become a buzz in the contemporary world.  Most of us use them without understanding what they are and what they could do.  Basically, social networks are a platform the enable friends and family to be touch.  A good deal eh? Yes, it started just like that.  It started as a proxy for a pubs, playgrounds, parties where friends hang around.  Instead of going  for meeting a person, these social networks enabled friends to be in touch over instant messaging, blogging, scribbling, sharing pictures and videos by making use of the Internet technology.  This is definitely good, so far.  Do you know what else these social networks could do to you?  Let’s see some possibilities for an arbitrary ‘X’ whom you never know who that is:-
  1. Find your friends.  If you are little dumb, one can easily find your crush, hated ones, and what not.
  2. Find your likes and dislikes based on your messages, scribblings
  3. Know your places of interest based on what you have written about places, pointers on meeting spots that you chose to meet your friends, etc.
  4. Potentially know your neighborhood, by going through your blogs of scribblings about your neighbors or happenings in your vicinity.
  5. Find out what food you eat, and what you are allergic to.
  6. Find out what type of network connection you have, based on your connectivity logs.
  7. Find out paths from friends-of-friends for connecting an arbitrary person to you through your contacts.
  8. Find out which communities you belong to, and hence your social and professional networks
  9. Find out where you work, what your hobbies are.
  10. Find out where you studied, what you studied, what you aspire and where you are heading to; based on your professional community memberships, activity in forums, answers queries, etc.
  11. Find who you would meet, when and where based on your conversations
  12. Find out your mood swings, based on the vocabulary of your conversations
  13. Find out your pictures, and the pictures of the related ones in family and profession.
Remember something, whatever is meant to be private is never private in Social networks or any other electronic 3rd party services for that matter.  I have listed a very brief potential possibilities.  Please use these platform with fullest caution.

One side of the Internet researchers are working on “Internet Anonymization”, an idea to protect one’s identity; on the other hand, Internet giants are pushing “Social Networks” where they build intelligence on the so-called privacy of people.  What a weird world !?!

Powered by ScribeFire.

Text Mining Question Bank

July 19th, 2010 No comments

1. Natural Language Processing

  1. Give 5 examples for Holonyms, Hyponyms, Hypernyms, Metonyms, Meronyms, Homonyms, Synonyms, Polysems.
  2. Draw the Venn diagram of Spellings-Meanings-Pronunciations.
  3. Why are Context Free Grammars Context free ?
  4. What is the difference between RTN and ATN ?
  5. Give examples of Prepositional Phrases.
  6. Compare CFG and ATN.
  7. Give 5 examples for Anaphora, Cataphora, Endophora, Exophora.
  8. Give 5 examples of NP ellipsis, VP ellipsis.
  9. Write a CFG, ATN for the following:
    1. “Tech Companies queue up for Open Source Professionals”.
    2. I love my language.
    3. Patriotism is not about watching cricket matches together.
    4. AMD’s microcode is more richer than Intel.
    5. Ron Weasley should marry Hermoine Granger.
    6. Krishna is a metonym for uncertainty.
    7. PMPO is 8 times that of RMS power measured for a 1KHz signal with an amplitude of 1V.
  10. What are the Named Entities in
    1. “Open Source helps Life Spring Hospitals” ?
    2. I want to work for Burning Glass Technologies Inc.
    3. The university life at SRM is very informal.
    4. AMD Phenom 5500 Black Edition can be unleashed to 4 cores.
    5. Hail Hitler!
    6. Anushka is taller than Surya.
  11. Do NP chunking on
    1. Tips and Tools for measuring the world and beating the odds
    2. The crazy frog is an awesome song
    3. Time flies like arrow.
    4. Thevaram was written by Appar.
    5. Text mining is awfully interesting.
    6. I need to get placed is a good company.
  12. Write a Regular Expression for replacing the beginning and end of all the lines in a text file with the strings “” and “” respectively.
  13. Write a regular expression for capturing Indian mobile numbers, land line numbers and Indian pin codes with maximum possible inherent validation.
  14. Write a regular expression for capturing the vehicle numbers, PAN numbers, Passport numbers in a new paper article.
  15. Identify rules to capturing dates and discriminating the job dates, education dates and date of birth.
  16. Give examples for Noun stemming in English & {Tamil or Telugu or Hindi} languages.  Transliterate the Indian language.
  17. Give examples for Verb stemming in English & {Tamil or Telugu or Hindi} languages.  Transliterate the Indian language.
  18. How does a spell checker work ?
  19. Take some arbitrary texts and summarize them in to a line or two.  Justify the reason for the choice of words and sentences in your summary.
  20. Show some examples for word-by-word, sentence-by-sentence, context-by-context machine translation.

2. Information Extraction & Statistical NLP

  1. If Prob(A) is 0.4 and Prob(B) is 0.6, what is Prob(A,B), Prob(A|B), Prob(A u B), Prob(A – B), Prob(A n B) ?  If some data is missing, assume a reasonable value for it.
  2. Let A be a random variable with instances a1, a2, a3, a4, a5.  If P(a1) = 1.8e-4, P(a2) = 5.2e-8, P(a3) = 0.042, P(a4) = 0.00052, P(a5)=0.2, compute Sigma P(A), PI P(A) without mathematical underflow.
  3. Give real life examples for 1st order markov processes.
  4. Give real life examples of Expectation-Maximization.
  5. If p[[0.1 0.3 0.2 0.4],[0.3 0.4 0.2 0.1],[0.3 0.3 0.1 0.3], [0.2 0.4 0.1 0.3]] is the state transition probability of any 4 states {A,B,C,D} in a HMM, calculate P(A->B->C->D).
  6. Based on (5), check whether the probability of state sequence is commutative (ex: P(A->B->C) = P(C->B->A) ?)
  7. If the observation probability is [[.2 .4 .1 .3], [.6 .1 .0 .3], [.0 .0 .0 1.0], [.1 .1 .1 .7], [.4 .4 .1 .1]] for observations {i, j, k, l, m} in states as per(5). Compute the P(O={k,l}).
  8. Annotate the items in (9) of Section 1 and build the state transition, observation, initial probability matrices.
  9. Show that usage of forward probabilities reduce the time-complexity of evaluation problem.
  10. Show that usage of forward-backward probabilities reduce the time-complexity of decoding problem.

    Powered by ScribeFire.

DevCamp 2010 by ThoughtWorks Inc., Chennai.

July 11th, 2010 No comments

Developer Camp 2010
10th July 2010, Chennai

It was my first attempt to take part in a BarCamp / unconference, which excited me very much after reading about them in Wikipedia.  Through some contacts, I was invited to attend the Developer Camp hosted by ThoughtWorks Inc, at Thiru Vi. Ka. Industrial Estate, Ekkattuthangal, Chennai on 10th July 2010.  I had originally offered to give a couple of talks on Text mining and Design patterns.  Though I had some anxiety about whether topics like Text Mining would sell amongst hard core developers, I was comforted by Balaji Damodaran (organizer) that there should be a lot of people interested in exploring AI.

    I reached ThoughtWorks office at 9:15AM and was surprised to find atleast a couple of dozen developers already come in.  Saturday morning for hard core developers start only after 11AM, but I was happy to be wrong then 🙂 Registered myself as one of the developers and opted to talk about “Text Mining Applications”, “Plagiarism Detection”, “Text Classification using Naive Bayes”, “Design Patterns” for the 9:30AM slot.  The unconference started at around 9:45 with the introduction by Balaji Damodaran.  At that time, atleast 70 developers were there in the hall (cafetaria).  Then I was asked to start the talk by 10AM.  When I went to the hall, it had only 5 people as audience, which kind of killed me as I am always used to having big crowd as my audience (what an EGO I have!?).

   I had asked a couple of the audience boys to go for hunting more audience for the talk.  See I were to advertise and promote my talk, which in fact is critical for everything in the world we live.  One of the volunteers advised to use a microphone and start the talk.  When I started the talk, I was surprised to see that people walked in to fill up the hall.  The talk went on and on with a lot of interesting examples which made everyone introspect about the way we see and assess our neighbourhood.   I am sure my audience have understood now that everything that we see around and solve could be mathematically modeled and be solved using computers.  Hurray, we made it!!

    Followed by that talk, I was asked to talk about Design patterns as a lot of developers had voted for that topic.  Ok, I wanted a coffee break! Went to the cafeteria and made some light south Indian coffee.  I added some pulverized sugar to my coffee and came back to the hall, while I was talking with another developer from LatentView technologies.  To my surprise, the coffee tasted like made with sea water. Then I realized that I had added salt instead of sugar.  I would like to greet the “brahaspathi” who kept the salt bowl near the coffee vending machine. 🙂

    The talk on Design pattern started in a small room as the number of votes was ~10 (which is still a large number) in unconferences. When we started that talk, one of the volunteer said, he would want to record the talk which is a good idea. The talk started, and we found that lot of people started to come into the room and we had to move to a bigger hall as the number of audience was over 40, which is like “wow”. The talk went on for a while and we interacted about Singleton vs Multiton, Strategy, Factory vs Bridge patterns with lots of examples. Overall, it was a wonderful discussion forum where we learned a lot of insight about software design using design patterns.

    If I were to use one word to describe the audience, I would say “intriguing”.  It was an awesome experience for me to talk about some of my experiences to a wonderful audience that you had brought it.  It is very rare to find a combination of patient, smart, involved, intelligent, experienced audience who crave for knowledge.  Our talks helped us to introspect on to the technology that we have been practicing. The ambiance was very motivating in the sense, lot of natural light and spaciousness.  Overall, I enjoyed every bit of it.  I am little depressed that I could not enjoy the food as I was rushing back to office.  Also, I wanted to take part in the fish bowl about Industry-Academic Co-op, but couldn’t.  I am sure, there is a lot of people who got benefited by this program, in fact I heard that statement from a lot of the audience after the lecture/talk.

Thanks to Shiv Deepak for introducing DevCamp.
Thanks to Balaji Damodaran for inviting me to the DevCamp.
Thanks to Shaswat Nimesh for the photographs.

EFYTimes news article is here.

Powered by ScribeFire.

Endometriosis, The Turning Point of my Life

April 17th, 2009 No comments

Endometriosis (from endo, “inside”, and metra, “womb”) is a medical condition in women in which endometrial cells are deposited in areas outside the uterine cavity. The uterine cavity is lined by endometrial cells, which are under the influence of female hormones. Endometrial cells deposited in areas outside the uterus (endometriosis) continue to be influenced by these hormonal changes and respond similarly as do those cells found inside the uterus. Symptoms often exacerbate in time with the menstrual cycle. Endometriosis is typically seen during the reproductive years; it has been estimated that it occurs in roughly 5% to 10% of women. Symptoms depend on the site of implantation. Its main but not universal symptom is pelvic pain in various manifestations. Endometriosis is a common finding in women with infertility.

Excerpts from Wikipedia (Original Link

Professionally, this word “Endometriosis” served as a turning point in forcing me to think beyond and invent better technology for text processing systems. As you know, I am a text mining scientist. My full time job is to make computers understand English text.  Lately, I had built a system that would identify context of supplied text (I process Resumes and Jobs) based on co-occurrence patterns.  When I clarified for “secretary” in my system, the system came back and said “endometriosis” is the closest keyword in the same context.  I was perplexed and decided to tract the source.  Interestingly I found that the documents that contained the keyword “endometriosis” were resumes of secretaries who had worked for doctors treating endometriosis.  Here the technique is not wrong, but the data set that is used for building the system was skewed.  To overcome this defect of unsupervised system, I was forced to device an advanced semi-supervised system where noisy and completely-erroneous prediction like above could be taken care of.  This “endometriosis” case has helped me think wider an develop better technology that made my life easy and my inventions more accurate.

Powered by ScribeFire.

Sweet Summation Vector

July 17th, 2008 No comments

The popular way to represent a Text Document in vector space is by the summation vector (resultant vector) of all the (meaningful) keywords that formed the text document.

There are two ways to identify the keywords from the Document Text:

  1. Use all the words in the document text and depending upon their frequency promote them as keywords or drop them as noise ( words with higher frequency are generally noise words )
  2. In general, scientists maintain a keyword collection with which they do the lookup to identify the set of keywords that generated the document.

Both the methods have upsides and downsides. The trick here is to have a method by which we select only the contextually meaningful keywords from the document text.

Here is one of the method to enrich the document vector, assuming that the document content is homogenous ( few similar semantic contexts )

  1. Generate the summation vector ( resultant vector ) using all the chosen keywords
  2. Correlate all the chosen keywords individually against the resultant vector (look out for keywords that show negative correlation or very low correlation)
  3. Place a cutoff of correlation score to be 0.2 (when cosine similarity is 0.2, the angle between the word and resultant vector is around 75 degrees! )
  4. Remove the words that do not fit the cutoff (threshold) from our selection set of keywords
  5. Generate the Resultant vector again based on the chosen keywords ( after the above filtering )
  6. Iterate steps 2,3,4,5 to get the rejected words ( iteration rejections are to be appended to the master rejection set ) and accepted keywords.
  7. Stop iteration when there are no more keywords to be rejected.
  8. The final summation vector is the enriched resultant vector, which would model the document much closely than the first one we started with.
  9. We may correlate again the accepted and rejected keywords against the enriched resultant vector to witness the boost in maxima of correlation score for the accepted items and minima of correlation score for rejected items. ( the positive correlations become more positive and negative correlations become more negative in the due course of iterations ).
  10. The final set of the accepted items could be assumed as the actual set of keywords that generate the document.

Happy Vectorization…