Text Mining Question Bank

July 19th, 2010 No comments

Natural Language Processing

  1. Give 5 examples for Holonyms, Hyponyms, Hypernyms, Metonyms, Meronyms, Homonyms, Synonyms, Polysems.
  2. Draw the Venn diagram of Spellings-Meanings-Pronunciations.
  3. Why are Context Free Grammars Context free ?
  4. What is the difference between RTN and ATN ?
  5. Give examples of Prepositional Phrases.
  6. Compare CFG and ATN.
  7. Give 5 examples for Anaphora, Cataphora, Endophora, Exophora.
  8. Give 5 examples of NP ellipsis, VP ellipsis.
  9. Write a CFG, ATN for the following:
    1. “Tech Companies queue up for Open Source Professionals”.
    2. I love my language.
    3. Patriotism is not about watching cricket matches together.
    4. AMD’s microcode is more richer than Intel.
    5. Ron Weasley should marry Hermoine Granger.
    6. Krishna is a metonym for uncertainty.
    7. PMPO is 8 times that of RMS power measured for a 1KHz signal with an amplitude of 1V.
  10. What are the Named Entities in
    1. “Open Source helps Life Spring Hospitals” ?
    2. I want to work for Burning Glass Technologies Inc.
    3. The university life at SRM is very informal.
    4. AMD Phenom 5500 Black Edition can be unleashed to 4 cores.
    5. Hail Hitler!
    6. Anushka is taller than Surya.
  11. Do NP chunking on
    1. Tips and Tools for measuring the world and beating the odds
    2. The crazy frog is an awesome song
    3. Time flies like arrow.
    4. Thevaram was written by Appar.
    5. Text mining is awfully interesting.
    6. I need to get placed is a good company.
  12. Write a Regular Expression for replacing the beginning and end of all the lines in a text file with the strings “<BOL>” and “<EOL>” respectively.
  13. Write a regular expression for capturing Indian mobile numbers, land line numbers and Indian pin codes with maximum possible inherent validation.
  14. Write a regular expression for capturing the vehicle numbers, PAN numbers, Passport numbers in a new paper article.
  15. Identify rules to capturing dates and discriminating the job dates, education dates and date of birth.
  16. Give examples for Noun stemming in English & {Tamil or Telugu or Hindi} languages.  Transliterate the Indian language.
  17. Give examples for Verb stemming in English & {Tamil or Telugu or Hindi} languages.  Transliterate the Indian language.
  18. How does a spell checker work ?
  19. Take some arbitrary texts and summarize them in to a line or two.  Justify the reason for the choice of words and sentences in your summary.
  20. Show some examples for word-by-word, sentence-by-sentence, context-by-context machine translation.

Information Extraction & Statistical NLP

  1. If Prob(A) is 0.4 and Prob(B) is 0.6, what is Prob(A,B), Prob(A|B), Prob(A u B), Prob(A – B), Prob(A n B) ?  If some data is missing, assume a reasonable value for it.
  2. Let A be a random variable with instances a1, a2, a3, a4, a5.  If P(a1) = 1.8e-4, P(a2) = 5.2e-8, P(a3) = 0.042, P(a4) = 0.00052, P(a5)=0.2, compute ∑P(A), ∏P(A) without mathematical underflow.
  3. Give real life examples for 1st order markov processes.
  4. Give real life examples of Expectation-Maximization.

    Powered by ScribeFire.

DevCamp 2010 by ThoughtWorks Inc., Chennai.

July 11th, 2010 No comments

Developer Camp 2010
10th July 2010, Chennai

It was my first attempt to take part in a BarCamp / unconference, which excited me very much after reading about them in Wikipedia.  Through some contacts, I was invited to attend the Developer Camp hosted by ThoughtWorks Inc, at Thiru Vi. Ka. Industrial Estate, Ekkattuthangal, Chennai on 10th July 2010.  I had originally offered to give a couple of talks on Text mining and Design patterns.  Though I had some anxiety about whether topics like Text Mining would sell amongst hard core developers, I was comforted by Balaji Damodaran (organizer) that there should be a lot of people interested in exploring AI.

    I reached ThoughtWorks office at 9:15AM and was surprised to find atleast a couple of dozen developers already come in.  Saturday morning for hard core developers start only after 11AM, but I was happy to be wrong then :) Registered myself as one of the developers and opted to talk about “Text Mining Applications”, “Plagiarism Detection”, “Text Classification using Naive Bayes”, “Design Patterns” for the 9:30AM slot.  The unconference started at around 9:45 with the introduction by Balaji Damodaran.  At that time, atleast 70 developers were there in the hall (cafetaria).  Then I was asked to start the talk by 10AM.  When I went to the hall, it had only 5 people as audience, which kind of killed me as I am always used to having big crowd as my audience (what an EGO I have!?).

   I had asked a couple of the audience boys to go for hunting more audience for the talk.  See I were to advertise and promote my talk, which in fact is critical for everything in the world we live.  One of the volunteers advised to use a microphone and start the talk.  When I started the talk, I was surprised to see that people walked in to fill up the hall.  The talk went on and on with a lot of interesting examples which made everyone introspect about the way we see and assess our neighbourhood.   I am sure my audience have understood now that everything that we see around and solve could be mathematically modeled and be solved using computers.  Hurray, we made it!!

    Followed by that talk, I was asked to talk about Design patterns as a lot of developers had voted for that topic.  Ok, I wanted a coffee break! Went to the cafeteria and made some light south Indian coffee.  I added some pulverized sugar to my coffee and came back to the hall, while I was talking with another developer from LatentView technologies.  To my surprise, the coffee tasted like made with sea water. Then I realized that I had added salt instead of sugar.  I would like to greet the “brahaspathi” who kept the salt bowl near the coffee vending machine. :)

    The talk on Design pattern started in a small room as the number of votes was ~10 (which is still a large number) in unconferences. When we started that talk, one of the volunteer said, he would want to record the talk which is a good idea. The talk started, and we found that lot of people started to come into the room and we had to move to a bigger hall as the number of audience was over 40, which is like “wow”. The talk went on for a while and we interacted about Singleton vs Multiton, Strategy, Factory vs Bridge patterns with lots of examples. Overall, it was a wonderful discussion forum where we learned a lot of insight about software design using design patterns.

    If I were to use one word to describe the audience, I would say “intriguing”.  It was an awesome experience for me to talk about some of my experiences to a wonderful audience that you had brought it.  It is very rare to find a combination of patient, smart, involved, intelligent, experienced audience who crave for knowledge.  Our talks helped us to introspect on to the technology that we have been practicing. The ambiance was very motivating in the sense, lot of natural light and spaciousness.  Overall, I enjoyed every bit of it.  I am little depressed that I could not enjoy the food as I was rushing back to office.  Also, I wanted to take part in the fish bowl about Industry-Academic Co-op, but couldn’t.  I am sure, there is a lot of people who got benefited by this program, in fact I heard that statement from a lot of the audience after the lecture/talk.

Thanks to Shiv Deepak for introducing DevCamp.
Thanks to Balaji Damodaran for inviting me to the DevCamp.
Thanks to Shaswat Nimesh for the photographs.

EFYTimes news article is here.

Powered by ScribeFire.

Thunderbird Battery Charger

June 30th, 2010 No comments

Powered by ScribeFire.

கோபிநாத்திற்கு திருமண வாழ்த்து

June 17th, 2010 No comments

நல்லோரே
நீவிர் நித்தம்
நன் நினைவுகளுடன்,
நற்செல்வமீட்டி,
நன் மக்களீன்று,
நற்றுணை நல்க.

ACE 5.6.7 does not compile with STLport in Win32 environment

June 14th, 2010 No comments

ACE 5.6.7 does not compile with STLport in Windows environment (I used vc9 on Windows Server 2008) because of the following header in ACE (ACE_wrappers/ace/checked_iterator.h), which wrongly assumes the existence of stdext::checked_array_iterator in the iterator header.  A PRF is already submitted in the ACE mailing list (http://www.archivum.info/comp.soft-sys.ace/2008-07/00026/%5Bace-users%5D-Checked_iterator.h-problem-with-STLport..html)

# if defined (_MSC_VER) && (_MSC_FULL_VER >= 140050000)
// Checked iterators are currently only supported in MSVC++ 8 or better.
#  include <iterator>
# endif  /* _MSC_VER >= 1400 */

# if defined (_MSC_VER) && (_MSC_FULL_VER >= 140050000)
template <typename PTR>
stdext::checked_array_iterator<PTR>
ACE_make_checked_array_iterator (PTR buf, size_t len)
{
return stdext::checked_array_iterator (buf, len);
}
# else
template <typename PTR>
PTR
ACE_make_checked_array_iterator (PTR buf, size_t /* len */)
{
// Checked iterators are unsupported.  Just return the pointer to
// the buffer itself.
return buf;
}
# endif  /* _MSC_VER >= 1400 */

#endif  /* ACE_CHECKED_ITERATOR_H */

I need to develop a solution to it.  The easiest way to find a solution is to use some macro explicitly set by the STLport header, which is not set by any other STL libraries.  I chose to use the _STLP_ITERATOR macro set by “stl/stlport/iterator” header.

#ifndef _STLP_ITERATOR
#define _STLP_ITERATOR

# ifndef _STLP_OUTERMOST_HEADER_ID
#  define _STLP_OUTERMOST_HEADER_ID 0×38
#  include <stl/_prolog.h>
# endif

# ifdef _STLP_PRAGMA_ONCE
#  pragma once
# endif

#if defined (_STLP_IMPORT_VENDOR_STD)
# include _STLP_NATIVE_HEADER(iterator)
#endif /* IMPORT */

# ifndef _STLP_INTERNAL_ITERATOR_H
#  include <stl/_iterator.h>
# endif

# ifndef _STLP_INTERNAL_STREAM_ITERATOR_H
#  include <stl/_stream_iterator.h>
# endif

# if (_STLP_OUTERMOST_HEADER_ID == 0×38)
#  include <stl/_epilog.h>
#  undef _STLP_OUTERMOST_HEADER_ID
# endif

#endif /* _STLP_ITERATOR */

The solution is the following, where I have added the !defined(_STLP_ITERATOR) condition along with the check for Visual Studio compiler version.

# if !defined(_STLP_ITERATOR) && defined (_MSC_VER) && (_MSC_FULL_VER >= 140050000)
// Checked iterators are currently only supported in MSVC++ 8 or better.
#  include <iterator>
# endif  /* _MSC_VER >= 1400 */

# if defined (_MSC_VER) && (_MSC_FULL_VER >= 140050000)
template <typename PTR>
stdext::checked_array_iterator <PTR>
ACE_make_checked_array_iterator (PTR buf, size_t len)
{
return stdext::checked_array_iterator (buf, len);
}
# else
template <typename PTR>
PTR
ACE_make_checked_array_iterator (PTR buf, size_t /* len */)
{
// Checked iterators are unsupported. Just return the pointer to
// the buffer itself.
return buf;
}
#
endif  /* _MSC_VER >= 1400 */

#endif
/* ACE_CHECKED_ITERATOR_H */

Goodu’s Pan Shop

May 30th, 2010 No comments


நீங்கள் பார்ப்பது டெய்டல்ஸ் சாலையில் சில காலம் முன்னர் இருந்த ஒரு பான் கடை. இந்த கடை இப்போது அங்கு இல்லை, ஏதொ ஒரு உணவகமாக மாற்றப்பட்டுள்ளது. திரு குட்டு (Goodu) தான், இந்த கடையில் முதலாளி. அவர் ஆர்டரின் பேரில் சப்ளை செய்வதாக விளம்பரப்படுத்திய விதம் என்னை மிகவும் கவர்ந்தது. ஆங்கில உச்சரிப்பை அப்படியே ஆங்கிலத்தின் எழுத முயன்றுள்ளார். அவருடைய முயற்சிக்கு என்னுடைய பாராட்டுக்கள். ஆனால் ஆங்கிலம் தான் மிக வருந்தத்தக்க முறையில் உள்ளது. ம்ம்ம்.. அது ஆங்கிலேயரின் பிரச்சனை, நமக்கென்ன? :)

அவர் சொல்ல வந்தது என்னவென்றால்:-

PAN Shop
Order and Party Supply
Mr Goodu
Phone: 9977192747..

திருத்தம் (19/07/2010): குட்டுவின் கடை திரும்பி திறக்கப்பட்டுவிட்டது!

Powered by ScribeFire.

Tags: , , ,

OMG OWS

May 23rd, 2010 No comments

OWS Spark plugs: These tiny fireworks are made with Iridium tips that ensure the fireworks with great health quotient for a long long time.  We call it the life-time plug.  Fit it & forget it; of course the mechanic will clean it when you give your car for servicing! :)
 
Lately, my car that is a Getz GVS 1.1 petrol, was treated with Bardahl engine flush and supplements; Bardahl transmission concentrate; Bardahl engine oil.  After this treatment, the engine was sluggish due to the very-high viscosity of the Bardahl treatment.  When I consulted with the service people, they promised that the performance would become much better once the Bardahl becomes little lighter.  I drove the car for a thousand kilometers and could see some improvements in the way the engine responded but not satisfied at all.

Then, I fitted the Green Cotton replacement filter, which was purchased from www.petes.in to my car.  The response of the engine started to become better but not the best.  I took a ride for about 900 km which included about 80 km hill driving. The ride proved that the free flow cotton filter is indeed working good. 

Following that, I got the OWS Iridium plugs (4 nos) from www.Speedworks.in.  Oh My God, the car was never responsive before like now.  I could feel the pickup boosted whenever I put my foot on the accelerator.  The 3rd gear response of the engine has become pretty awesome.  I am really enjoying this now.

Details:
Green Cotton Filter: Rs 4200 [got online from Petes.in, Cochin]
OWS Spark plugs: Rs 755 x 4 = Rs 3050 [Speedworks, Next to Eldorado building, Nungambakkam high Rd, Chennai]

Freeflow Exhaust Silencer

April 28th, 2010 No comments

என்னுடைய Thunderbirdன் Silencer பழுதடைந்தது. மாற்ற வேண்டி சென்னை Taylor’s Roadல் உள்ள சௌந்தரராஜன் மெகானிக் கடையின் ராஜு என்ற மெகானிக்கிடம் சென்றேன். அவர் Silencerஐ சோதித்து பார்த்துவிட்டு மாற்றித்தான் ஆகவேண்டும் என்று பரிந்துரைத்தார். அப்போது, இந்த Freeflow Silencer போட்டால் என்ன என்று கேட்டேன். அதற்கு அவர் கொடுத்த பதில் இதோ:-

ஒருவன் சத்தமாக தொண்டை கிழிய அலறினால் என்ன ஆகும்? தொண்டை நிரந்திரமாக பழுதடையும்; அதேபோல Freeflow Silencer போட்டால் இஞ்சினின் செயல்பாடு பழுதடையும், ஆயுள் குறையும் என்றார்!

ஆமாம், freeflow silencer போட்டால் பவர் (BHP) கூடுவது உண்மையே. அதேபோல mileageம் குறையும். ஏனெனில் அதிகமாக எரிபொருள் இஞ்சினிற்கு உள்ளே சென்று எரிந்து வேகமாக வெளியேரும். அதனோடு சரியாக எரியாத எரிபொருளும் வெளியேரும். Freeflowவினால், முழுமையான எரிதலை உறுதிபடுத்த முடியாது. அதனால் எரிபொருள் விரையமாகும், mileage குறையும். மேலும் freeflow silencerலிருந்து வெளிப்படும் புகையானது (கண்ணுக்கு தெரியவேண்டும் என்று அவசியமில்லை) emission standardsக்கு ஒவ்வாது.

Only Coffee

April 11th, 2010 No comments

NH45 தேசிய நெடுஞ்சாலையில் பயணிக்கும்போது மறவாமல் மதுராந்தகத்தின் சாலையோரம் இருக்கும் “Only Coffee” என்ற தேனீர்விடுதியில் காப்பி அருந்திப்பாருங்கள். இந்த விடுதி சரியாக “Highway Inn”க்கு எதிர்புறமாக இருக்கிறது. நேற்று இங்கே காப்பி குடிக்கும்போது ஏன் இந்த இடத்தை தேர்தெடுத்தார்கள் என்று சிந்தித்தேன். இப்போது புரிகிறது, சரியாக அடையாளம் காட்டதான் போல.

கும்பகோணம் டிகிரி காப்பி, அவ்வளவு அருமையாக இருக்கும் என்று இப்போதுதான் தெரிந்துக்கொண்டேன். ஒரு காப்பி ரூ10, ஆனால் சரியான விலைதான் போல, சுவையை கருத்தில் கொண்டால்.

BSNL Broadband Connectivity Issue on Noise phone lines

April 10th, 2010 No comments

If you are an exclusive BSNL broadband user, you might not have attached the telephone to the phone line.  I have connected my Netgear modem to the DSL/Phone line splitter and left the other connection floating.  Lately, when I noticed that the Netgear modem was not able to make the connection with BSNL servers, originally I thought the telephone line is dead.  To my surprise the telephone line was fine, but I perceived the lines to be little noisy.  I made a complaint to the BSNL portal and as usual nothing much happened.  Accidently, I had to connect my telephone to the splitter for making a local call.  To surprise, the Netgear modem managed to connect to the server this time.  So, the hypothesis is;

When the telephone line is noisy, attach the telephone to the splitter along with the modem connection to get connected to the BSNL Servers.  Most likely it could be because of the Reactive load offered by the telephone on the phone line ends up conditioning the Phase modulated signals for the Netgear modem to connect to the Servers.

Powered by ScribeFire.