Archive

Posts Tagged ‘Text Mining’

Text Mining Question Bank

July 19th, 2010 No comments

Natural Language Processing

  1. Give 5 examples for Holonyms, Hyponyms, Hypernyms, Metonyms, Meronyms, Homonyms, Synonyms, Polysems.
  2. Draw the Venn diagram of Spellings-Meanings-Pronunciations.
  3. Why are Context Free Grammars Context free ?
  4. What is the difference between RTN and ATN ?
  5. Give examples of Prepositional Phrases.
  6. Compare CFG and ATN.
  7. Give 5 examples for Anaphora, Cataphora, Endophora, Exophora.
  8. Give 5 examples of NP ellipsis, VP ellipsis.
  9. Write a CFG, ATN for the following:
    1. “Tech Companies queue up for Open Source Professionals”.
    2. I love my language.
    3. Patriotism is not about watching cricket matches together.
    4. AMD’s microcode is more richer than Intel.
    5. Ron Weasley should marry Hermoine Granger.
    6. Krishna is a metonym for uncertainty.
    7. PMPO is 8 times that of RMS power measured for a 1KHz signal with an amplitude of 1V.
  10. What are the Named Entities in
    1. “Open Source helps Life Spring Hospitals” ?
    2. I want to work for Burning Glass Technologies Inc.
    3. The university life at SRM is very informal.
    4. AMD Phenom 5500 Black Edition can be unleashed to 4 cores.
    5. Hail Hitler!
    6. Anushka is taller than Surya.
  11. Do NP chunking on
    1. Tips and Tools for measuring the world and beating the odds
    2. The crazy frog is an awesome song
    3. Time flies like arrow.
    4. Thevaram was written by Appar.
    5. Text mining is awfully interesting.
    6. I need to get placed is a good company.
  12. Write a Regular Expression for replacing the beginning and end of all the lines in a text file with the strings “<BOL>” and “<EOL>” respectively.
  13. Write a regular expression for capturing Indian mobile numbers, land line numbers and Indian pin codes with maximum possible inherent validation.
  14. Write a regular expression for capturing the vehicle numbers, PAN numbers, Passport numbers in a new paper article.
  15. Identify rules to capturing dates and discriminating the job dates, education dates and date of birth.
  16. Give examples for Noun stemming in English & {Tamil or Telugu or Hindi} languages.  Transliterate the Indian language.
  17. Give examples for Verb stemming in English & {Tamil or Telugu or Hindi} languages.  Transliterate the Indian language.
  18. How does a spell checker work ?
  19. Take some arbitrary texts and summarize them in to a line or two.  Justify the reason for the choice of words and sentences in your summary.
  20. Show some examples for word-by-word, sentence-by-sentence, context-by-context machine translation.

Information Extraction & Statistical NLP

  1. If Prob(A) is 0.4 and Prob(B) is 0.6, what is Prob(A,B), Prob(A|B), Prob(A u B), Prob(A – B), Prob(A n B) ?  If some data is missing, assume a reasonable value for it.
  2. Let A be a random variable with instances a1, a2, a3, a4, a5.  If P(a1) = 1.8e-4, P(a2) = 5.2e-8, P(a3) = 0.042, P(a4) = 0.00052, P(a5)=0.2, compute ∑P(A), ∏P(A) without mathematical underflow.
  3. Give real life examples for 1st order markov processes.
  4. Give real life examples of Expectation-Maximization.

    Powered by ScribeFire.

DevCamp 2010 by ThoughtWorks Inc., Chennai.

July 11th, 2010 No comments

Developer Camp 2010
10th July 2010, Chennai

It was my first attempt to take part in a BarCamp / unconference, which excited me very much after reading about them in Wikipedia.  Through some contacts, I was invited to attend the Developer Camp hosted by ThoughtWorks Inc, at Thiru Vi. Ka. Industrial Estate, Ekkattuthangal, Chennai on 10th July 2010.  I had originally offered to give a couple of talks on Text mining and Design patterns.  Though I had some anxiety about whether topics like Text Mining would sell amongst hard core developers, I was comforted by Balaji Damodaran (organizer) that there should be a lot of people interested in exploring AI.

    I reached ThoughtWorks office at 9:15AM and was surprised to find atleast a couple of dozen developers already come in.  Saturday morning for hard core developers start only after 11AM, but I was happy to be wrong then :) Registered myself as one of the developers and opted to talk about “Text Mining Applications”, “Plagiarism Detection”, “Text Classification using Naive Bayes”, “Design Patterns” for the 9:30AM slot.  The unconference started at around 9:45 with the introduction by Balaji Damodaran.  At that time, atleast 70 developers were there in the hall (cafetaria).  Then I was asked to start the talk by 10AM.  When I went to the hall, it had only 5 people as audience, which kind of killed me as I am always used to having big crowd as my audience (what an EGO I have!?).

   I had asked a couple of the audience boys to go for hunting more audience for the talk.  See I were to advertise and promote my talk, which in fact is critical for everything in the world we live.  One of the volunteers advised to use a microphone and start the talk.  When I started the talk, I was surprised to see that people walked in to fill up the hall.  The talk went on and on with a lot of interesting examples which made everyone introspect about the way we see and assess our neighbourhood.   I am sure my audience have understood now that everything that we see around and solve could be mathematically modeled and be solved using computers.  Hurray, we made it!!

    Followed by that talk, I was asked to talk about Design patterns as a lot of developers had voted for that topic.  Ok, I wanted a coffee break! Went to the cafeteria and made some light south Indian coffee.  I added some pulverized sugar to my coffee and came back to the hall, while I was talking with another developer from LatentView technologies.  To my surprise, the coffee tasted like made with sea water. Then I realized that I had added salt instead of sugar.  I would like to greet the “brahaspathi” who kept the salt bowl near the coffee vending machine. :)

    The talk on Design pattern started in a small room as the number of votes was ~10 (which is still a large number) in unconferences. When we started that talk, one of the volunteer said, he would want to record the talk which is a good idea. The talk started, and we found that lot of people started to come into the room and we had to move to a bigger hall as the number of audience was over 40, which is like “wow”. The talk went on for a while and we interacted about Singleton vs Multiton, Strategy, Factory vs Bridge patterns with lots of examples. Overall, it was a wonderful discussion forum where we learned a lot of insight about software design using design patterns.

    If I were to use one word to describe the audience, I would say “intriguing”.  It was an awesome experience for me to talk about some of my experiences to a wonderful audience that you had brought it.  It is very rare to find a combination of patient, smart, involved, intelligent, experienced audience who crave for knowledge.  Our talks helped us to introspect on to the technology that we have been practicing. The ambiance was very motivating in the sense, lot of natural light and spaciousness.  Overall, I enjoyed every bit of it.  I am little depressed that I could not enjoy the food as I was rushing back to office.  Also, I wanted to take part in the fish bowl about Industry-Academic Co-op, but couldn’t.  I am sure, there is a lot of people who got benefited by this program, in fact I heard that statement from a lot of the audience after the lecture/talk.

Thanks to Shiv Deepak for introducing DevCamp.
Thanks to Balaji Damodaran for inviting me to the DevCamp.
Thanks to Shaswat Nimesh for the photographs.

EFYTimes news article is here.

Powered by ScribeFire.

Endometriosis, The Turning Point of my Life

April 17th, 2009 No comments

Endometriosis (from endo, “inside”, and metra, “womb”) is a medical condition in women in which endometrial cells are deposited in areas outside the uterine cavity. The uterine cavity is lined by endometrial cells, which are under the influence of female hormones. Endometrial cells deposited in areas outside the uterus (endometriosis) continue to be influenced by these hormonal changes and respond similarly as do those cells found inside the uterus. Symptoms often exacerbate in time with the menstrual cycle. Endometriosis is typically seen during the reproductive years; it has been estimated that it occurs in roughly 5% to 10% of women. Symptoms depend on the site of implantation. Its main but not universal symptom is pelvic pain in various manifestations. Endometriosis is a common finding in women with infertility.

Excerpts from Wikipedia (Original Link

Professionally, this word “Endometriosis” served as a turning point in forcing me to think beyond and invent better technology for text processing systems. As you know, I am a text mining scientist. My full time job is to make computers understand English text.  Lately, I had built a system that would identify context of supplied text (I process Resumes and Jobs) based on co-occurrence patterns.  When I clarified for “secretary” in my system, the system came back and said “endometriosis” is the closest keyword in the same context.  I was perplexed and decided to tract the source.  Interestingly I found that the documents that contained the keyword “endometriosis” were resumes of secretaries who had worked for doctors treating endometriosis.  Here the technique is not wrong, but the data set that is used for building the system was skewed.  To overcome this defect of unsupervised system, I was forced to device an advanced semi-supervised system where noisy and completely-erroneous prediction like above could be taken care of.  This “endometriosis” case has helped me think wider an develop better technology that made my life easy and my inventions more accurate.

Powered by ScribeFire.

Sweet Summation Vector

July 17th, 2008 No comments

The popular way to represent a Text Document in vector space is by the summation vector (resultant vector) of all the (meaningful) keywords that formed the text document.

There are two ways to identify the keywords from the Document Text:

  1. Use all the words in the document text and depending upon their frequency promote them as keywords or drop them as noise ( words with higher frequency are generally noise words )
  2. In general, scientists maintain a keyword collection with which they do the lookup to identify the set of keywords that generated the document.

Both the methods have upsides and downsides. The trick here is to have a method by which we select only the contextually meaningful keywords from the document text.

Here is one of the method to enrich the document vector, assuming that the document content is homogenous ( few
similar semantic contexts )

  1. Generate the summation vector ( resultant vector ) using all the chosen keywords
  2. Correlate all the chosen keywords individually against the resultant vector (look out for keywords that show negative correlation or very low correlation)
  3. Place a cutoff of correlation score to be 0.2 (when cosine similarity is 0.2, the angle between the word and
    resultant vector is around 75 degrees! )
  4. Remove the words that do not fit the cutoff (threshold) from our selection set of keywords
  5. Generate the Resultant vector again based on the chosen keywords ( after the above filtering )
  6. Iterate steps 2,3,4,5 to get the rejected words ( iteration rejections are to be appended to the master rejection set ) and accepted keywords.
  7. Stop iteration when there are no more keywords to be rejected.
  8. The final summation vector is the enriched resultant vector, which would model the document much closely than the first one we started with.
  9. We may correlate again the accepted and rejected keywords against the enriched resultant vector to witness the boost in maxima of correlation score for the accepted items and minima of correlation score for rejected items. ( the positive correlations become more positive and negative correlations become more negative in the due course of iterations ).
  10. The final set of the accepted items could be assumed as the actual set of keywords that generate the document.

Happy Vectorization…