Apache Sqoop

Introduction to Sqoop –

Apache Sqoop is a tool to efficiently transfer bulk data between apache hadoop and structured datastores like RDBMS or EDWs. Sqoop can be used to import data from Hadoop and related eco-systems like Hive. Also, Sqoop can be used to import data from RDBMS and data marts.

To get the data into HDFS from RDBMS or vice versa, is a tedious task which is solved mostly by writing scripts. To solve this problem, we have Apache sqoop which has the capability to move data in and around the HDFS. This uses map reduce in the background and is also fault tolerant. We can achieve different types of data transfer like full load, partial load, incremental load with different types of file types and compression.

Using Apache Sqoop –

Connect to mysql DB –

>sqoop list-databases –connect jdbc:mysql://ms.towardsbetterworld.com –username retail_user –password dummy@123

To list tables in Sqoop –

>sqoop list-tables –connect jdbc:mysql://ms.towardsbetterworld.com/retail_db –username retail_user –password dummy@123

Running queries on source DB(eval) –

#you can use eval to run queries on source DB to see the result prior to doing anything.
>sqoop eval –connect jdbc:mysql://towardsbetterworld.com/retail_db –username retail_user –password dummy@123 –query “SELECT * FROM orders LIMIT 10”

Sqoop Import(using target-dir) –

#The below sqoop command will connect to the mysql database retail_db and will import the orders table at the target-dir location specified.
>sqoop import –connect jdbc:mysql://ms.towardsbetterworld.com:3306/retail_db –username retail_user –password dummy@123
–table orders –target-dir /usr/sourabh/sqoop_import/retail_db
/orders

Sqoop Import(using warehouse-dir) –

#When using the warehouse-dir, what you do is provide the base directory where the import should take place.
>sqoop import –connect jdbc:mysql://ms.towardsbetterworld.com:3306/retail_db –username retail_user –password dummy@123
–table orders –warehouse-dir /usr/sourabh/sqoop_import/retail_db

Sqoop import(deleting existing data) –

#delete-target-dir will delete the existing data in that table
>sqoop import \
–connect “jdbc:mysql://towardsbetterworld.com:3306/retail_db” \
–username retail_user \
–password dummy@123 \
–table order_items \
–warehouse-dir /user/sourabh/sqoop_import/retail_db \
–delete-target-dir

Sqoop import(appending to existing data) –

#using –append will append data to the table
>sqoop import \
–connect “jdbc:mysql://towardsbetterworld.com:3306/retail_db” \
–username retail_user \
–password dummy@123 \
–table order_items \
–warehouse-dir /user/sourabh/sqoop_import/retail_db \
–append

Sqoop import(importing as different file formats) –

#you can import in different file formats using
#1)– as-parquetfile
#2)– as-sequencefile
#3)– as-textfile
#4)– as-avrodatafile
>sqoop import \
–connect “jdbc:mysql://towardsbetterworld.com:3306/retail_db” \
–username retail_user \
–password dummy@123 \
–table order_items \
–warehouse-dir /user/sourabh/sqoop_import/retail_db \
–delete-target-dir
— as-avrodatafile

Sqoop import(compression) –

#core-site.xml specifies the compression algorithm to be used which you can change in there.
>sqoop import \
–connect “jdbc:mysql://towardsbetterworld.com:3306/retail_db” \
–username retail_user \
–password dummy@123 \
–table order_items \
–warehouse-dir /user/sourabh/sqoop_import/retail_db \
–delete-target-dir \
— compress \
— compression-codec org.apache.hadoop.io.compress.SnappyCodec

Sqoop Import(number of mappers) –

#When we want to change the number of mappers to be used, we can use num-mappers to change that.
>sqoop import \
–connect “jdbc:mysql://towardsbetterworld.com:3306/retail_db” \
–username retail_user \
–password dummy@123 \
–table order_items \
–warehouse-dir /user/sourabh/sqoop_import/retail_db \
–delete-target-dir
–num-mappers 8
#this changes the number of mappers to 8

Sqoop Import(import specific columns) –

#First you need to find out the column names and to do that you can directly run the describe on that table via sqoop
>sqoop eval \
–connect jdbc:mysql://towardsbetterworld.com:3306/retail_db \
–username retail_user \
–password dummy@123 \
-e “DESCRIBE customers”

#Now you can mention the column names from that table which you want to import
>sqoop import \
–connect jdbc:mysql://towardsbetterworld.com:3306/retail_db \
–username retail_user \
–password dummy@123 \
–table customers \
–columns customer_id,customer_fname,customer_lname,\
customer_street,customer_city,customer_state,customer_zipcode \
–warehouse-dir /user/sourabh/sqoop_import/retail_db \
–delete-target-dir

Sqoop Import(using where clause) –

#When you want to import data on some specific conditions and you want to use the where clause
>sqoop import –connect jdbc:mysql://towardsbetterworld.com:3306/retail_db –username retail_user –password dummy@123 –table orders –warehouse-dir /user/sourabh/sqoop_import/retail_db –delete-target-dir –where “order_status IN(‘COMPLETE’,’CLOSED’) AND order_date LIKE ‘2013-08%'”

Sqoop Import(import all tables) –

#If you want to import all tables from a DB
>sqoop import-all-tables \
–connect jdbc:mysql://towardsbetterworld.com:3306/retail_db \
–username retail_user \
–password dummy@123 \
–warehouse-dir /user/sourabh/sqoop_import/retail_db \
–autoreset-to-one-mapper \
–num-mappers 2

Sqoop(create hive table) –

#The below sqoop command will create a Hive table in the database retail
>create-hive-table will create an empty table
sqoop create-hive-table \
–connect jdbc:mysql://towardsbetterworld.com:3306/retail_db \
–username retail_user \
–password dummy@123 \
–table orders \
–hive-database retail

Sqoop import (in hive table) –

#Import in a hive table orders in retail db
>sqoop import \
–connect jdbc:mysql://towardsbetterworld.com:3306/retail_db \
–username retail_user \
–password itversity \
–table orders \
–hive-import \
–hive-database retail