{"id":1964,"date":"2018-08-01T01:08:39","date_gmt":"2018-08-01T01:08:39","guid":{"rendered":"http:\/\/capture.ccio.us\/?p=1964"},"modified":"2018-08-01T14:43:13","modified_gmt":"2018-08-01T14:43:13","slug":"sphere-contextually-related-content","status":"publish","type":"post","link":"https:\/\/capture.club\/portal\/2018\/08\/01\/sphere-contextually-related-content\/","title":{"rendered":"Case Study: Sphere and the Case of Contextually Related Content."},"content":{"rendered":"<body><p><\/p><img decoding=\"async\" class=\"aligncenter wp-image-1967 size-medium\" src=\"http:\/\/capture.ccio.us\/wp-content\/uploads\/2018\/07\/Screenshot-2018-07-31-17.40.59-300x201.png\" alt=\"Sphere: the Origins of MemeFinder\" width=\"300\" height=\"201\" loading=\"lazy\">\n<p>We began working with related content provider Sphere over a decade ago.\u00a0 Sphere was the nexus between thousands of \u201clong-tail\u201d bloggers (E.g. private individuals) and high-volume news sites like CNN, Time and the Wall St. Journal.\u00a0 Personally, I would say that it was easily one of my most exciting opportunities.\u00a0 Working there, you really felt like you had your pulse on current events, because a large part of our job there was assuring that our related content matched the story appropriately.<\/p>\n<p>Ultimately this was a bit trickier than you might expect.\u00a0 \u00a0Roughly 90% or better of our results were beautifully accurate.\u00a0 We were using <a href=\"https:\/\/en.wikipedia.org\/wiki\/Apache_Lucene\">Lucene<\/a> before it was cool to do so, and it was phenomenal at assisting us in determining relevancy.\u00a0 There were, however, some types of corpora that proved difficult to determine contextual relevancy in real-time.\u00a0 Among the culprits, travel articles, articles heavily-laden with slang, articles slim on content were some of the most notorious culprits.<\/p>\n<p>This had actually been a thorn-in-the-side of the company since it\u2019s inception, and all the tuning in the world on the Lucene query generated had no effect.<\/p>\n<p>The solution to this problem was to create\u00a0<a href=\"http:\/\/capture.ccio.us\/memefinder-no-dictionaries-needed\/\">MemeFinder<\/a>.<\/p>\n<h3>Thinking outside the virtual sandbox\u2026<\/h3>\n<p>So I was thusly tasked with solving this problem, and I was wondering how I would go about it. The primary problem was that the entity extraction was keying on words that were <a href=\"https:\/\/en.wikipedia.org\/wiki\/List_of_true_homonyms\">homonyms<\/a>, but beyond that, it was making what we see as a very common mistake in the search industry:\u00a0 it was focusing on commonality.\u00a0 Now, on the surface level, it makes sense to focus on the common aspects between two text corpora.\u00a0 <a href=\"https:\/\/en.wikipedia.org\/wiki\/Tf%E2%80%93idf\">Term Frequency<\/a> is an industry standard algorithm, albeit being somewhat arcane, it remains effective in many cases, like its younger brother, <a href=\"https:\/\/en.wikipedia.org\/wiki\/Okapi_BM25\">BM25.<\/a>\u00a0BM25 uses Probabilistic Relevance matching to determine rank.\u00a0 Both of these algorithms \u2014 both industry standards \u2014 work well for many cases, however there are edge cases where they simply fall short.<\/p>\n<p>So when I began creating MemeFinder, I approached it from the perspective of a linguist, rather than a computer scientist.<\/p>\n<p>Did I mention that I was an English\/Sociology major long, long ago?\u00a0 It\u2019s true, and up until I wrote MemeFinder it was my dirty little secret that you just didn\u2019t talk about in technology circles.\u00a0 I\u2019ve actually been writing code since I was a kid \u2014 back in the 70\u2019s when most kids didn\u2019t write code \u2014 but I ultimately decided to pursue a career studying my native language.<\/p>\n<p>As it turns out, it was one of the best decisions I ever made.<\/p>\n<p>Because it was that ability to think like and English major that allowed me to break the language down into its various components: pronouns, modal auxiliaries, articles, prepositions and so on.\u00a0 Once I had those gathered, for the most part, I discarded them, at least initially.\u00a0 MemeFinder also discards much of the common ontology.\u00a0 There are over 100,000 words in the English language, but only about 10,000 are used on a regular basis.\u00a0 That\u2019s really not a large data set, when you think about it.\u00a0 The problem, though, was the fact that it was English.\u00a0 English is the Borg of languages, absorbing whatever it finds useful, creating words where there were non prior \u2014 oh, and people are generally horrible at spelling and writing, etc.\u00a0 They DO however, take note of the important aspects of a document. Thus,\u00a0 the keys, then, were ultimately found in punctuation, grammar and semantics.\u00a0 MemeFinder focused on the differences, rather than the commonality, and generated a <em>\u2018memeotype\u2019\u00a0<\/em> from this distinctiveness, which could then be turned into a distinctive Lucene query.<\/p>\n<p>Parsing the document and generating the query only represented part of the solution, though.\u00a0 The other aspects were speed and resource consumption.<\/p>\n<p>That is: it had to be fast and lean.<\/p>\n<p>Sphere processed several hundred million requests per day.\u00a0 There was really no time to spare in parsing documents.\u00a0 MemeFinders architecture is both thread-safe and at the same time largely static.\u00a0 \u00a0This means that once instantiated, it runs extremely fast, processing an average 600-1000 word article in under 100 milliseconds.<\/p>\n<p>Yet speed and linguistic ninja tricks aside, perhaps its most distinctive aspect was it\u2019s entity extraction, and the manner in which it compared corpora.<\/p>\n<h3>The Paradox of Related Content and Diminishing Returns<\/h3>\n<p>One thing you might notice about the typical search engine:\u00a0 There is a \u201csweet spot\u201d of query terms, where too few returns irrelevant documents, and too many returns, generally, nothing too closely related.\u00a0 MemeFinder is different in this aspect because it was designed to compare to bodies of text, rather than a fragment.\u00a0 Thus the more query terms MemeFinder has to work with, the more precise the results.<\/p>\n<p>MemeFinder was put into production at Sphere and solved the contextual relevance issue.<\/p>\n<\/body>","protected":false},"excerpt":{"rendered":"<p>We began working with related content provider Sphere over a decade ago.\u00a0 Sphere was the nexus between thousands of \u201clong-tail\u201d bloggers (E.g. private individuals) and high-volume news sites like CNN, Time and the Wall St. Journal.\u00a0 Personally, I would say that it was easily one of my most exciting opportunities.\u00a0 Working there, you really felt [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"om_disable_all_campaigns":false,"pagelayer_contact_templates":[],"_pagelayer_content":"","footnotes":""},"categories":[],"tags":[],"class_list":["post-1964","post","type-post","status-publish","format-standard","hentry"],"jetpack_featured_media_url":"","_links":{"self":[{"href":"https:\/\/capture.club\/portal\/wp-json\/wp\/v2\/posts\/1964","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/capture.club\/portal\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/capture.club\/portal\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/capture.club\/portal\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/capture.club\/portal\/wp-json\/wp\/v2\/comments?post=1964"}],"version-history":[{"count":0,"href":"https:\/\/capture.club\/portal\/wp-json\/wp\/v2\/posts\/1964\/revisions"}],"wp:attachment":[{"href":"https:\/\/capture.club\/portal\/wp-json\/wp\/v2\/media?parent=1964"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/capture.club\/portal\/wp-json\/wp\/v2\/categories?post=1964"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/capture.club\/portal\/wp-json\/wp\/v2\/tags?post=1964"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}