{"id":1430,"date":"2018-04-26T15:48:04","date_gmt":"2018-04-26T15:48:04","guid":{"rendered":"http:\/\/capture.ccio.us\/?p=1430"},"modified":"2018-04-26T15:48:04","modified_gmt":"2018-04-26T15:48:04","slug":"making-lucidworks-fusion-work-custom-parsing-index-pipelines","status":"publish","type":"post","link":"https:\/\/capture.club\/portal\/2018\/04\/26\/making-lucidworks-fusion-work-custom-parsing-index-pipelines\/","title":{"rendered":"Making Lucidworks Fusion Work For You: Custom Parsing and Index Pipelines"},"content":{"rendered":"<body><p><\/p>\n<div><img decoding=\"async\" class=\"alignnone size-full wp-image-1434\" src=\"http:\/\/capture.ccio.us\/wp-content\/uploads\/2018\/04\/index-pipeline.png\" alt=\"index-pipeline\" width=\"1200\" height=\"300\" loading=\"lazy\"><\/div>\n<p>Out-of-the-box, <a href=\"https:\/\/lucidworks.com\" target=\"_blank\">Lucidworks Fusion<\/a>\u00ae does a great many tasks remarkably well. Every now and then, however, you come across an issue that may take a little extra effort to index. What I\u2019ll describe below, in this particular case, is a way to circumvent the Fusion parser and spin up your own custom <a href=\"https:\/\/doc.lucidworks.com\/fusion-pipeline-javadocs\/2.4\/com\/lucidworks\/apollo\/common\/pipeline\/PipelineDocument.html#getFields-java.lang.String-\" target=\"_blank\">PipelineDocument<\/a> in an <a href=\"https:\/\/doc.lucidworks.com\/fusion\/2.4\/REST_API_Reference\/Index-Pipelines-API.html#submit-a-set-of-documents-to-an-index-pipeline\" target=\"_blank\">Index Pipeline Stage<\/a> in Fusion.<br>\nSo, first things first. We need to override the Fusion parser framework. We\u2019ll do this by changing two things in our POST to the index pipeline.<br>\nFirst, we\u2019ll change the <b>Content-Type<\/b><br>\nFrom:<\/p>\n<pre>application\/vnd.lucidworks-document<\/pre>\n<p>To:<\/p>\n<pre>application\/octet-stream<\/pre>\n<p>This tells Fusion that what we\u2019re sending is a raw stream, instead of a PipelineDocument JSON object. This is critical, because otherwise the parser will try to do something with your input, which we want to avoid in this case.<br>\nNext, we set the \u2018skipParsing\u2019 query param to \u2018true\u2019. Our POST url will then look like this:<\/p>\n<pre>pipelines\/A_Nested_Documents\/collections\/_test_\/index?echo=true&amp;skipParsing=true<\/pre>\n<div><img decoding=\"async\" class=\"alignnone size-full wp-image-1442\" src=\"http:\/\/capture.ccio.us\/wp-content\/uploads\/2018\/04\/Screenshot-2018-04-26-10.14.52.png\" alt=\"screenshot-2018-04-26-10-14-52\" width=\"965\" height=\"209\" loading=\"lazy\"><\/div>\n<p>You\u2019ll notice that I\u2019ve also added \u2018echo=true\u2019. This means a successful result will return 200 rather than 204 (No Content). This is entirely optional. You can also \u201csimulate=true\u201d if you want to test \u2014 but not actually add \u2014 your documents to the collection.<\/p>\n<h3>Now to the code\u2026<\/h3>\n<p>From the code side, the first thing you\u2019ll want to do is create a new Index Pipeline in Fusion. You\u2019ll give it a name (ID) and then you\u2019ll add a JavaScript stage.<\/p>\n<div><img decoding=\"async\" class=\"alignnone size-full wp-image-1444\" src=\"http:\/\/capture.ccio.us\/wp-content\/uploads\/2018\/04\/Screenshot-2018-04-26-10.40.34.png\" alt=\"screenshot-2018-04-26-10-40-34\" width=\"900\" height=\"448\" loading=\"lazy\"><\/div>\n<p>Next, you\u2019ll want to declare the Java classes you\u2019ll be using:<br>\n<code><br>\nif (doc !== null &amp;&amp; doc.getFirstFieldValue(\"_raw_content_\") !== null) { <\/code><br>\nvar ex = java.lang.Exception;<br>\nvar pipelineDoc = com.lucidworks.apollo.common.pipeline.PipelineDocument;<br>\nvar outdocs = java.util.ArrayList;<br>\nvar ObjectMapper = org.codehaus.jackson.map.ObjectMapper;<br>\nvar SerializationConfig = org.codehaus.jackson.map.SerializationConfig;<br>\nvar String = java.lang.String;<br>\nvar e = java.lang.Exception;<br>\nvar base64 = java.util.Base64;<br>\nvar String = java.lang.String;<br>\nvar ArrayList = java.util.ArrayList;<br>\nvar mapper = new ObjectMapper();<br>\nvar pretty = true;<br>\nvar result = new String(\u201c\u201d);<br>\noutdocs = new ArrayList();<br>\npipelineDoc = new com.lucidworks.apollo.common.pipeline.PipelineDocument();<br>\nNotice that I check the doc for \u2018null\u2019 and check for the _raw_content_ field to be there BEFORE my declarations. This is because there is no reason to declare anything if the document isn\u2019t there, or your content isn\u2019t there.<br>\n<code><br>\ntry {<br>\nvar rawlist = doc.getFieldValues(\"_raw_content_\");<br>\nraw = rawlist[0]; <\/code><br>\n\/\/ logger.info(\u201cRaw class: \u201d + raw.getClass().getSimpleName());<br>\n\/\/ will be a byte[]<br>\nvar stdin = new String(raw, \u201cUTF8\u201d); \/\/ turn the stream into a string.<br>\nvar json = JSON.parse(stdin); \/\/ parse with the JavaScript JSON parser.<br>\n\/\/ logger.info(\u201cJSON: \u201d + stdin);<br>\nmapper.configure(SerializationConfig.Feature.FAIL_ON_EMPTY_BEANS, false);<br>\nvar children = json[0].fields; \/\/ get fields for the parent.<br>\n\/\/ set our PipelineDocument ID<br>\npipelineDoc.setId(json[0].id);<br>\nlogger.info(\u201cdocument ID: \u201d + doc.getId());<br>\nHere we\u2019re taking the UTF8 encoded byte array, and turning it into a java.lang.String, then using the native JSON parser to create a JavaScript JSON object. In this case, there was an issue with the Java Jackson JSON parser and polymorphic documents, so we wanted to avoid the parser, and handle it ourselves. At the end of it all, we set the ID for our new PipelineDocument<br>\nNow comes the heavy lifting part. We have our JSON object, and we\u2019re going to plug it into our PipelineDocument. In this case, we\u2019re adding child documents denoted by the \u201c_childDocuments_\u201d field. You can store these in a variety of ways, so you\u2019ll want to consider that before you set out to code your project. In this instance, I\u2019m embedding the PipelineDoucments into my parent, but you could create separate documents and bind them together using the Taxonomy API in Fusion as well, but I digress.<br>\n<code><br>\nif (children !== null) { <\/code><br>\n\/**<br>\n* Here is the heavy lifting. Now that we have a JSON object,<br>\n* We will transform it into a PipelineDocuemnt. This<br>\n* document will be fasioned after our collection schema.<br>\n*\/<br>\nfor (var i = 0; i<br>\n} catch (ex) {<br>\nlogger.error(ex.getLocalizedMessage());<br>\n}<br>\n\/\/ if all goes well, we\u2019ll return our new doc right here.<br>\nlogger.info(\u201cRETURN BRANCH 1\u201d);<br>\noutdocs.add(pipelineDoc);<br>\nreturn outdocs;<br>\n} else {<br>\nlogger.info(\u201cDoc ID was null\u201d);<br>\n\/\/ if something was wrong with the doc, or it was null, we return here.<br>\nreturn doc;<br>\n}<br>\n\/\/ otherwise, we return here.<br>\nreturn doc;<br>\n}<\/p>\n<p>And there it is! Your document has now been indexed into Fusion\/Solr<br>\nYou can find the complete code for this blog here: <a href=\"https:\/\/github.com\/kmcowan\/Lucidworks_Fusion_NashornJS\/blob\/master\/CustomParseNestedObjectIndexPipelineStage.js\" target=\"_blank\">CustomIndexStageParsing.<\/a>, and sample JSON here: <a href=\"https:\/\/github.com\/kmcowan\/Lucidworks_Fusion_NashornJS\/blob\/master\/nested_children.json\" target=\"_blank\">Nested.json<\/a><br>\nHappy Searching!<\/p>\n<\/body>","protected":false},"excerpt":{"rendered":"<p>Out-of-the-box, Lucidworks Fusion\u00ae does a great many tasks remarkably well. Every now and then, however, you come across an issue that may take a little extra effort to index. What I\u2019ll describe below, in this particular case, is a way to circumvent the Fusion parser and spin up your own custom PipelineDocument in an Index [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"om_disable_all_campaigns":false,"pagelayer_contact_templates":[],"_pagelayer_content":"","footnotes":""},"categories":[],"tags":[],"class_list":["post-1430","post","type-post","status-publish","format-standard","hentry"],"jetpack_featured_media_url":"","_links":{"self":[{"href":"https:\/\/capture.club\/portal\/wp-json\/wp\/v2\/posts\/1430","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/capture.club\/portal\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/capture.club\/portal\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/capture.club\/portal\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/capture.club\/portal\/wp-json\/wp\/v2\/comments?post=1430"}],"version-history":[{"count":0,"href":"https:\/\/capture.club\/portal\/wp-json\/wp\/v2\/posts\/1430\/revisions"}],"wp:attachment":[{"href":"https:\/\/capture.club\/portal\/wp-json\/wp\/v2\/media?parent=1430"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/capture.club\/portal\/wp-json\/wp\/v2\/categories?post=1430"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/capture.club\/portal\/wp-json\/wp\/v2\/tags?post=1430"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}