Tuesday, May 1, 2018

Using the XML SerDe in Hive for Exploding Nested XML Elements

     This article will give a detailed account of how to install and use the XML Serde, by Dmitry Vasilenkoin Hive located in github here.

     Detailed instructions for the installation can be found here, but I'm going to take you through the step I specifically used.

1. Download the latest version of the XML SerDe jar from here.
2. Pick a directory on the linux OS, where the Hive server is running on, and upload the jar to it. In this example I used /usr/lib/hive/lib .
3. Change the owner of this dir to hive: chown /usr/lib/hive/lib hive:hive
4. Make sure the jar is executable: chmod +x /usr/lib/hive/lib/hivexmlserde-
5. Set the "Hive Auxiliary JARs Directory" Hive configuration in Cloudera manager to /usr/lib/hive/lib
6. There may be sentry settings, if you are using it, that need to be made and instructions on how to do that are in the link provided.

     The XML SerDe allows you to query an XML file as if it was a relational table. This is a perfect example of the term "schema on read" that gets tossed around when discussing the benefits of Hadoop. In fact, you can have many Hive tables that reference a single XML document, each with a different view of that file.
Say we had an XML file that looks like this:

     Now from looking at this data we see a few elements that can be separated out into their own tables:  <QUESTION><CUSTOMERLEVELDATA><ANALYSIS><DEMOGRAPHICS><HCAHPS><COMMENTS>

     First thing we need to do is create Hive tables that will form the basis for all of these different elements. Let's use the answers submitted in the XML doc as the basis for our table. This will be the base for building views for these elements :<CUSTOMERLEVELDATA><ANALYSIS><DEMOGRAPHICS><HCAHPS>

DROP TABLE IF EXISTS raw_answers_xml;
CREATE EXTERNAL TABLE sampledb.raw_answers_xml( 
customerleveldata array< 
row format serde 'com.ibm.spss.hive.serde2.xml.XmlSerDe' 
WITH serdeproperties ( "column.xpath.customerleveldata"="/DATA_EXPORT/CUSTOMERLEVELDATA" ) 
stored AS 
inputformat 'com.ibm.spss.hive.serde2.xml.XmlInputFormat' 
outputformat 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat' 
location "/sources/sampledb/raw" 
tblproperties ( "xmlinput.start"="<DATA_EXPORT>", "xmlinput.end"="</DATA_EXPORT>" 

     The Hive table we want to create here is called raw_answers_xml and encompasses all of the mentioned elements. You'll notice that we translate the <CUSTOMERLEVELDATA> element into an array, a struct for the child elements <SURVEY_ID><CLIENT_ID><SERVICE><RECDATE><DISDATE>, an array for the
<DEMOGRAPHICS> element, a struct for it's child element <RESPONSE> and an array for another child element <HCAHPS> and its relevant child elements. I'd like to point out a few things here that should standout. The property column.xpath.customerleveldata takes xpath queries which you can use to traverse the XML document. In this instance we want to pull data in elements starting at <DATA_EXPORT><CUSTOMERLEVELDATA>. The next properties to touch upon are xmlinput.start and xmlinput.end. These tell the SerDe what should be the start and end elements in the XML document for capturing data. So make sure what ever elements you want to pull into the Hive table are within these start and end points.

     The first element we'll look at is the <CUSTOMERLEVELDATA> element, since this element contains non-repeating child elements (XML Array) that we want to break out into its own table (we'll start off simple) i.e. :


     This will form the basis for our answers_survey view, which will be our normalized tuple for containing unique instances of a survey response. The view we build to represent this part of the XML document will be called answers_survey and the code to generate it is:

DROP VIEW IF EXISTS answers_survey;
CREATE VIEW answers_survey AS 
 SELECT main_cols.survey_id, 
        Unix_timestamp( Concat(main_cols.recdate,' 00:00:00' ) )*1000 AS recdate,
        Unix_timestamp( Concat(main_cols.disdate,' 00:00:00' ) )*1000 AS disdate,
        input__file__name AS sourcefile
 sampledb.raw_answers_xml xml 
lateral VIEW explode(xml.customerleveldata) atable   
lateral VIEW inline (array(atable.atable)) ex_cols 
lateral VIEW inline (array(ex_cols.customerleveldata)) main_cols; 

     This view traverses the array declared in the raw_answers_xml table and explodes it so we can view the data in rows. But we're not done. All the explode does is handle the array, we still have to deal with the underlying structs. These are represented in JSON format, the first being the customerleveldata struct. To parse inside of this we use an inline to then access the struct underneath it. Then use another inline on that struct to get our columns and project them in our view.
Now that we showed a simple example, lets jump to one where we have repeating rows for a single survey. We will use the same base table raw_answers_xml and this time tackle the answers given under the <DEMOGRAPHICS> element. This view will be named answers_demographics and the code for it is as follows:

DROP VIEW IF EXISTS answers_demographics;
CREATE VIEW IF NOT EXISTS answers_demographics AS
  SELECT main_cols.survey_id,
         input__file__name AS sourcefile
  FROM   sampledb.raw_answers_xml xml 

lateral VIEW explode(xml.customerleveldata) atable AS atable 
lateral VIEW inline (array(atable.atable)) AS ex_cols 
lateral VIEW inline (array(ex_cols.customerleveldata)) AS main_cols 
lateral VIEW explode(main_cols.demographics) dtable AS dtable 
lateral VIEW inline (array(dtable.dtable)) AS d_cols 
lateral VIEW inline (array(d_cols.response)) AS r_cols;

     In this view we are traversing 2 arrays, the top level array for <CUSTOMERLEVELDATA> and the array for <DEMOGRAPHICS>. So we first explode <CUSTOMERLEVELDATA> then parse out the struct, which contains the survey_id that we can use to join to the answers_survey view we created in the previous example. Then we explode <DEMOGRAPHICS> to get to the struct to parse out all of the answers given for demographics.

     So those are 2 examples of how to parse out tables from the XML document using the SerDe. One of the tables representing a single instance of a survey, the other being multiple answers given for a survey. The best thing about this is we can derive all of our tables/views from the XML file(s) itself with transformations on the fly and without having to materialize on the cluster.

No comments:

Post a Comment