MemeFinder: No Dictionaries Needed
When I was working for the related content provier Sphere, I was presented with an interesting problem. We were using Apache Lucene back before using Lucene was cool, and overall it worked very well. However, there were some contextual gotchas that were causing them trouble. Homonyms, in particular, were causing un-related results to appear. Being a related content provider, this was really beyond the pale. Thus I was tasked with solving the problem.
Before moving on, there’s something you should know: I was an English/Sociology major in college. I grew up around technology, and have written code most of my life, but when I left college I left with a desire to write fiction. Over the next decade, I wrote several novels, published some short stories, signed with some agents, but at the end-of-the-day I wasn’t willing to write genre fiction — the fiction that pays — so when the internet heated up in 94-95, I decided to leverage those years of technology and give up being a writer. It was a difficult decision, and when I retire, I daydream fancifully of returning to that life, but I digress.
Despite the digression, my background with linguistics is important.
When I was tasked with solving this contextual problem, I decided to approach it from the perspective of a linguist rather than a computer scientist. I felt this would give the application a different approach than the ones I had studied, and that I ultimately found flawed, at least insofar as my purposes were concerned.
What I came up with is a rather nightmarish hybridization of statistics and linguistics, but it has proven over the years to be quite effective. Rather than relying on semantics, MemeFinder relies on language rules and statistical anomalies to generate a genotype for a text corpus.
I wanted MemeFinder to be fast, and I wanted it to be able to “learn”. I wanted it to be able to distinguish the common lexicon from statistically interesting fragments. To accomplish this, I described the ‘rules’ of the language to MemeFinder at the most rudimentary level.
From a statistical point of view, the English language isn’t really that difficult. There are roughly 100,000 words total, and from that, about 10% are used on a daily basis.
And in a 1000 word document, MemeFinder is generally interested in less than 1% of that document.
It took me about nine-months to write the first version of MemeFinder, but it was implemented in our system, and solved the contextual problem quite nicely by simply generating a more relevant Lucene query.
Since then, I’ve used MemeFinder across a variety of vertical indices, without having to “re-train.” All I have to do is turn it loose and it provides me with the relevant pieces of a document I need to match it to other documents.
And at the end of the day, that is what Entity Extraction, Machine Learning and search are all about.