{"id":1575,"date":"2018-05-03T22:07:43","date_gmt":"2018-05-03T22:07:43","guid":{"rendered":"http:\/\/capture.ccio.us\/?p=1575"},"modified":"2018-05-03T22:07:43","modified_gmt":"2018-05-03T22:07:43","slug":"extracting-values-element-attributes-using-jsoup-javascript-stage","status":"publish","type":"post","link":"https:\/\/capture.club\/portal\/2018\/05\/03\/extracting-values-element-attributes-using-jsoup-javascript-stage\/","title":{"rendered":"Extracting Values from Element Attributes using Jsoup and a JavaScript Stage"},"content":{"rendered":"<body><p>While Fusion comes with built-in Jsoup selector functionality, it is limited in its extraction capability. If you want to do something like extract attribute values \u2014 in particular attribute values with special characters or empty spaces in the values, you\u2019ll need to do a custom JavaScript stage and implement the extraction there.<\/p>\n<h2>To accomplish this:<\/h2>\n<p>1) Create a custom JavaScript stage and order it directly after the Apache Tika Parser. In the Apache Tika Parser stage, make sure that both \u201cReturn parsed content as XML or HTML\u201d and \u201cReturn original XML and HTML instead of Tika XML Output\u201d are checked.<br>\n2) Add your code. For the purposes of this article, I\u2019ve created the following example. Depending on what you\u2019re trying to accomplish, your code may vary:<\/p>\n<blockquote>\n<pre>function(doc){\n    var File = java.io.File;\nvar Iterator = java.util.Iterator;\nvar Jsoup = org.jsoup.Jsoup;\nvar Document = org.jsoup.nodes.Document;\nvar Element =  org.jsoup.nodes.Element;\nvar Elements = org.jsoup.select.Elements;\nvar content = doc.getFirstFieldValue(\"body\");\nvar jdoc = org.jsoup.nodes.Document;\nvar e = java.lang.Exception;\nvar div = org.jsoup.nodes.Element;\nvar img = org.jsoup.nodes.Element;\nvar iter = java.util.Iterator;\nvar divs = org.jsoup.select.Elements;\n   try {\n             jdoc = Jsoup.parse(content);\n             divs = jdoc.select(\"div\");\n             iter = divs.iterator();\n             div = null; \/\/ initialize our value to null\n            while (iter.hasNext()) {\n                div = iter.next();\n                if (div.attr(\"id\").equals(\"featured-img\")) {\n                    break;\n                }\n            }\n            if (div != null) {\n                 img = div.child(0);\n                logger.info(\"SRC: \" + img.attr(\"src\"));\n                logger.info(\"ORIG FILE: \" + img.attr(\"data-orig-file\"));\n                doc.addField(\"post_image\", img.attr(\"src\") + \" | \" + img.attr(\"data-orig-file\"));\n            } else {\n                logger.warn(\"Div was null\");\n            }\n        } catch ( e) {\n           logger.error(e);\n        }\n    return doc;\n}\n<\/pre>\n<\/blockquote>\n<p>So let\u2019s go ahead and break down what is happening here:<br>\n1) Declare Java classes to be used.<\/p>\n<blockquote>\n<pre>\nvar File = java.io.File;\nvar Iterator = java.util.Iterator;\nvar Jsoup = org.jsoup.Jsoup;\nvar Document = org.jsoup.nodes.Document;\nvar Element =  org.jsoup.nodes.Element;\nvar Elements = org.jsoup.select.Elements;\n<\/pre>\n<\/blockquote>\n<p>2) Next, declare our JavaScript variables to be used. <strong>Note that we assign the content variable to be the content pulled by the Apache Tika Parser<\/strong><\/p>\n<blockquote>\n<pre>var content = doc.getFirstFieldValue(\"body\");\nvar doc = org.jsoup.nodes.Document;\nvar e = java.lang.Exception;\nvar div = org.jsoup.nodes.Element;\nvar img = org.jsoup.nodes.Element;\nvar iter = java.util.Iterator;\nvar divs = org.jsoup.select.Elements;\n<\/pre>\n<\/blockquote>\n<p>3) Next, we pull the \u201cdiv\u201d elements out and look for one with an ID of \u201cfeatured-img.\u201d Once we find it, we \u2018break\u2019 the iteration and move on. <strong>Note: I\u2019m using this type of example to illustrate how to work with element attribute values that contain special characters or empty space. Jsoups selector syntax doesn\u2019t really play well with these types of key names. <\/strong><\/p>\n<blockquote>\n<pre> doc = Jsoup.parse(content); \/\/ parse the document\n             divs = doc.select(\"div\"); \/\/ select all the 'div' elements\n             iter = divs.iterator(); \/\/ get an iterator for the list\n            while (iter.hasNext()) { \/\/ iterate over the elements\n                div = iter.next();\n                if (div.attr(\"id\").equals(\"featured-img\")) { \/\/ if we find a match, assign and move on.\n                    break;\n                }\n            }\n<\/pre>\n<\/blockquote>\n<p>4) Finally, we set the values in the document. I\u2019ve added some extra logging here, which can ultimately be removed.<\/p>\n<blockquote>\n<pre>   if (div != null) {\n                 img = div.child(0); \/\/ get the image element\n                logger.info(\"SRC: \" + img.attr(\"src\"));\n                logger.info(\"ORIG FILE: \" + img.attr(\"data-orig-file\"));\n                doc.addField(\"post_image\", img.attr(\"src\") + \" | \" + img.attr(\"data-orig-file\")); \/\/ set the values in the PipelineDocument\n            } else {\n                logger.warn(\"Div was null\");\n            }\n<\/pre>\n<\/blockquote>\n<p><strong>And that\u2019s all there is to it! Happy Extracting!<\/strong><\/p>\n<\/body>","protected":false},"excerpt":{"rendered":"<p>While Fusion comes with built-in Jsoup selector functionality, it is limited in its extraction capability. If you want to do something like extract attribute values \u2014 in particular attribute values with special characters or empty spaces in the values, you\u2019ll need to do a custom JavaScript stage and implement the extraction there. To accomplish this: [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"om_disable_all_campaigns":false,"pagelayer_contact_templates":[],"_pagelayer_content":"","footnotes":""},"categories":[],"tags":[],"class_list":["post-1575","post","type-post","status-publish","format-standard","hentry"],"jetpack_featured_media_url":"","_links":{"self":[{"href":"https:\/\/capture.club\/portal\/wp-json\/wp\/v2\/posts\/1575","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/capture.club\/portal\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/capture.club\/portal\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/capture.club\/portal\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/capture.club\/portal\/wp-json\/wp\/v2\/comments?post=1575"}],"version-history":[{"count":0,"href":"https:\/\/capture.club\/portal\/wp-json\/wp\/v2\/posts\/1575\/revisions"}],"wp:attachment":[{"href":"https:\/\/capture.club\/portal\/wp-json\/wp\/v2\/media?parent=1575"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/capture.club\/portal\/wp-json\/wp\/v2\/categories?post=1575"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/capture.club\/portal\/wp-json\/wp\/v2\/tags?post=1575"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}