Saturday, July 1, 2017

Creating a Sqoop Metastore in Postgres

     If you are using sqoop to load data from your source system relational databases into Hadoop, you probably want a way to track the deltas. This way you won't have to do a full load every time or build some delta mechanism on your own. The sqoop metastore provides a way for your sqoop jobs to store a value so that the next time the job kicks off, it will load from after that value. So if you were loading based on some incremental value such as an auto increment key, and the last value you loaded was 15, the next time the job ran it would add a autoincrementkey>15 to the where clause when selecting. 

    If your Hadoop vendor is Cloudera, they will want you to use a Hyper SQL database for your sqoop metastore. It's ok and all, but if you already have MySQL or Postgres installed on your cluster, its just 1 more database for you to maintain. If you don't want to use Hyper SQL as your metastore, and have Postgres installed, you can follow the steps in this post for getting your sqoop metastore up and running on Postgres. (Just keep in mind, that if something happens, Cloudera will not help you since this currently not supported).




  Log into the CLI of the server running Postgres and su as postgres. 

   su postgres

  Then create a database that will act as our sqoop metastore.

   psql -c "create database sqoop"
  
   While still in the Postgres shell, create a new Role to operate under: 

  CREATE ROLE sqoop WITH PASSWORD 'test';

    Change the owner of the sqoop database to sqoop:


     ALTER DATABASE sqoop OWNER TO sqoop;

         Make sure the role can login:

         ALTER ROLE sqoop WITH LOGIN;

         Verify:
    
     \list
    
        Now connect to the sqoop database:

          \connect sqoop
      
          Create a table:

CREATE TABLE SQOOP_ROOT (
      VERSION INT,
      PROPNAME VARCHAR(128) NOT NULL,
      PROPVAL VARCHAR(256),
     CONSTRAINT SQOOP_ROOT_UNQ UNIQUE (VERSION, PROPNAME)
     );

          Insert storage data into new table:

            INSERT INTO SQOOP_ROOT
        VALUES(
          NULL,
          'sqoop.hsqldb.job.storage.version',
          '0';

        Insert job info into new table:

      INSERT INTO SQOOP_ROOT
VALUES(
    0,
    'sqoop.hsqldb.job.info.table',
    'SQOOP_SESSIONS'
          );
         Verify:

       SELECT * FROM SQOOP_ROOT;
    
       
              Create table SQOOP_SESSIONS. 
     
       CREATE TABLE SQOOP_SESSIONS(
         JOB_NAME VARCHAR(64) ,
         PROPNAME VARCHAR(128) ,
         PROPVAL VARCHAR(1024) ,
         PROPCLASS VARCHAR(32) ,
        CONSTRAINT SQOOP_SESSIONS_unq UNIQUE
(JOB_NAME ,
PROPNAME ,
PROPCLASS)
          );
           
        Grant table perms:

        GRANT ALL PRIVILEGES ON SQOOP_ROOT TO sqoop;
        GRANT ALL PRIVILEGES ON SQOOP_SESSIONS TO sqoop;
''
          Verify port postgres is running on (In this example it was running on 5432):

SELECT * FROM pg_settings WHERE name = 'port';


          Log into Cloudera Manager and go into the Sqoop 1 Client:

Figure 1. Cloudera Manager Sqoop Client
        


         In the Sqoop 1 Client screen, click on configuration and filter on:
       
         Sqoop 1 Client Client Advanced Configuration Snippet (Safety Valve) for sqoop-conf/sqoop-site.xml

          In that section add these 4 configs: 

                    Name: sqoop.metastore.client.enable.autoconnect
                   Value:  false
                   Description: If true, Sqoop will connect to a local metastore for job                                                   management when no other metastore arguments are provided.

                   Name: sqoop.metastore.client.autoconnect.url
                   Value: jdbc:postgresql://<your postgres server>:5432/sqoop
                   Description: Connection to the sqoop metastore database

                   Name: sqoop.metastore.client.autoconnect.username
                   Value: sqoop
                   Description: Sqoop metastore database user name

                   Name: sqoop.metastore.client.autoconnect.password
                   Value: test
                   Description: Sqoop metastore database password

Figure 2. Sqoop Client Config 

        Save changes and go back to Cloudera Manager's main screen. You should see a new icon next to the Sqoop 1 Client:

Figure 3. Stale Sqoop Config

         Click that and then click deploy changes on the next screen. You should now be good to go. Make sure a postgres JDBC driver exists in: /var/lib/sqoop on whatever server is running your sqoop jobs.  


        Create a test job to make sure everything is working. Here is a sample:

    sqoop job --create sample_job --meta-connect "jdbc:postgresql://<postgres server>:5432/sqoop?user=sqoop&password=test" -- import --connect jdbc:oracle:thin://@<oracle server>:1521/servicename --username user --password pass
--table SUPPORT.APC_CODE --as-avrodatafile --target-dir /staging/hpm_support/apc_code --check-column mycolumn --incremental append --last-value 0 m 1


     Verify job was created:

   select * from SQOOP_SESSIONS;


      To run the sample job:

       sqoop job --exec sample_job --meta-connect "jdbc:postgresql://<postgres server>:5432/sqoop?user=sqoop&password=test"


     Remove sample job:

   sqoop job --delete sample_job --meta-connect "jdbc:postgresql://<postgres server>:5432/sqoop?user=sqoop&password=test"


     Your Postgres metastore should now be all set to go!

   









    

No comments:

Post a Comment