As I found very few practice questions available on the internet for CCA 175 - Hadoop & spark developer exam. I set 6 questions exam with the solution provided in the comment section. If you complete it in less than 1 hour then and then think to apply CCA 175 exam else you need more practice.
Question's prerequisites:
import data from orders table with parquet file format and save data to hdfs path: /user/rj_example/parquet_data/orders
import data from customers table and save data to hdfs path: /user/rj_example/data/customers
import data from customers table with avro file format and save data to hdfs path: /user/rj_example/avro_data/customers
import data from customers table and save data to hdfs path: /user/rj_example/data/categories
import data from products table and save data to hdfs path: /user/rj_example/data/products with '\t' as fields seperator
create local dir 'rj_example'
copy data from /user/rj_example/data/products to local dir 'rj_example'
create hive database rajan_cca175_problems
STEPS for this prerequisites:
sqoop-import --connect jdbc:mysql://localhost/retail_db --username root --password cloudera --table orders --as-parquetfile --target-dir /user/rj_example/parquet_data/orders
sqoop-import --connect jdbc:mysql://localhost/retail_db --username root --password cloudera --table customers --delete-target-dir --target-dir /user/rj_example/data/customers
sqoop-import --connect jdbc:mysql://localhost/retail_db --username root --password cloudera --table customers --as-avrodatafile --target-dir /user/rj_example/avro_data/customers
sqoop-import --connect jdbc:mysql://localhost/retail_db --username root --password cloudera \
--table categories \
--target-dir /user/rj_example/data/categories
sqoop-import --connect jdbc:mysql://localhost/retail_db --username root --password cloudera \
--table products \
--target-dir /user/rj_example/data/products \
--fields-terminated-by '\t'
mkdir rj_example
hadoop fs -get /user/rj_example/data/products rj_example/
____________________________________________________________________________
PROBLEM 1:
import joined dataset from table orders and order_items from retail_db database with following constraint.
output should generate only two files
columns should be sepearated by tab
columns required: order_id, order_item_order_id, order_date, order_status, order_item_subtotal
enclosed with "
save data to hdfs path: /user/rj_example/prob1/output \
PROBLEM 2:
data stored in hdfs path /user/rj_example/parquet_data/orders and /user/rj_example/avro_data/customers
save data as following format:
customer_fname$$customer_lname$$order_count
output should be store in text file with gzip compression at hdfs location /user/rj_example/prob2/output
PROBLEM 3:
create mysql table products_rj in any database having schema like retail_db.products table
export data from local path rj_example/products to mysql table products_rj.
PROBLEM 4:
Take data from hdfs path: /user/rj_example/data/products and process data as following statement.
get products count in each product category. also min product price in each category.
output columns: category_name,minimun_product_price,product_count
output should be store in parquet file with gzip compression.
output should generate 3 file.
output should be sorted by product_count in descending order.
categories data stored in /user/rj_example/data/categories location.
Store output at hdfs: /user/rj_example/prob4/output
PROBLEM 5:
Store data from hdfs path /user/rj_example/data/customers to hive table rj_problem5 in database rajan_cca175_problems
Output should be compressed in snappy and Avro file format. location of this hive table should be /user/rj_example/hive/customers.
Also store source file data to /user/rj_example/prob5/output in JSON file format with gzip compression.
PROBLEM 6:
Take data from HDFS path: /user/rj_example/parquet_data/orders and process COMPLETE order status data as below.
Output format: order_id| order_date| order_status
order_date should store in format of dd-mm-yyyy
Save this data to hdfs path /user/rj_example/prob6/output in text file format with no compression
Question's prerequisites:
import data from orders table with parquet file format and save data to hdfs path: /user/rj_example/parquet_data/orders
import data from customers table and save data to hdfs path: /user/rj_example/data/customers
import data from customers table with avro file format and save data to hdfs path: /user/rj_example/avro_data/customers
import data from customers table and save data to hdfs path: /user/rj_example/data/categories
import data from products table and save data to hdfs path: /user/rj_example/data/products with '\t' as fields seperator
create local dir 'rj_example'
copy data from /user/rj_example/data/products to local dir 'rj_example'
create hive database rajan_cca175_problems
STEPS for this prerequisites:
sqoop-import --connect jdbc:mysql://localhost/retail_db --username root --password cloudera --table orders --as-parquetfile --target-dir /user/rj_example/parquet_data/orders
sqoop-import --connect jdbc:mysql://localhost/retail_db --username root --password cloudera --table customers --delete-target-dir --target-dir /user/rj_example/data/customers
sqoop-import --connect jdbc:mysql://localhost/retail_db --username root --password cloudera --table customers --as-avrodatafile --target-dir /user/rj_example/avro_data/customers
sqoop-import --connect jdbc:mysql://localhost/retail_db --username root --password cloudera \
--table categories \
--target-dir /user/rj_example/data/categories
sqoop-import --connect jdbc:mysql://localhost/retail_db --username root --password cloudera \
--table products \
--target-dir /user/rj_example/data/products \
--fields-terminated-by '\t'
mkdir rj_example
hadoop fs -get /user/rj_example/data/products rj_example/
____________________________________________________________________________
PROBLEM 1:
import joined dataset from table orders and order_items from retail_db database with following constraint.
output should generate only two files
columns should be sepearated by tab
columns required: order_id, order_item_order_id, order_date, order_status, order_item_subtotal
enclosed with "
save data to hdfs path: /user/rj_example/prob1/output \
PROBLEM 2:
data stored in hdfs path /user/rj_example/parquet_data/orders and /user/rj_example/avro_data/customers
save data as following format:
customer_fname$$customer_lname$$order_count
output should be store in text file with gzip compression at hdfs location /user/rj_example/prob2/output
PROBLEM 3:
create mysql table products_rj in any database having schema like retail_db.products table
export data from local path rj_example/products to mysql table products_rj.
PROBLEM 4:
Take data from hdfs path: /user/rj_example/data/products and process data as following statement.
get products count in each product category. also min product price in each category.
output columns: category_name,minimun_product_price,product_count
output should be store in parquet file with gzip compression.
output should generate 3 file.
output should be sorted by product_count in descending order.
categories data stored in /user/rj_example/data/categories location.
Store output at hdfs: /user/rj_example/prob4/output
PROBLEM 5:
Store data from hdfs path /user/rj_example/data/customers to hive table rj_problem5 in database rajan_cca175_problems
Output should be compressed in snappy and Avro file format. location of this hive table should be /user/rj_example/hive/customers.
Also store source file data to /user/rj_example/prob5/output in JSON file format with gzip compression.
PROBLEM 6:
Take data from HDFS path: /user/rj_example/parquet_data/orders and process COMPLETE order status data as below.
Output format: order_id| order_date| order_status
order_date should store in format of dd-mm-yyyy
Save this data to hdfs path /user/rj_example/prob6/output in text file format with no compression
Problem Solution:
ReplyDelete# PROBLEM 1:
sqoop-import --connect jdbc:mysql://localhost/retail_db --username root --password cloudera \
--query "select order_id, order_item_order_id, order_date, order_status, order_item_subtotal from orders o join order_items oi ON o.order_id=oi.order_item_order_id where \$CONDITIONS" \
--fields-terminated-by '\t' \
-m 2 \
--enclosed-by '"' \
--target-dir /user/rj_example/prob1/output \
--split-by order_item_order_id
# PROBLEM 2:
o=sqlContext.read.load("/user/rj_example/parquet_data/orders","parquet")
c=sqlContext.read.load("/user/rj_example/avro_data/customers","com.databricks.spark.avro")
joined=o.join(c,o['order_customer_id']==c['customer_id'],"right_outer")
joined.show()
joined.registerTempTable("joined")
finalDF=sqlContext.sql("select concat(customer_fname,'$$',customer_lname,'$$', count(1)) from joined group by customer_fname,customer_lname")
finalDF.rdd.map(lambda x: x[0]).saveAsTextFile("/user/rj_example/prob2/output","org.apache.hadoop.io.compress.GzipCodec")
# PROBLEM 3
create table rajan.products_rj as select * from retail_db.products limit 0;
sqoop-export --connect jdbc:mysql://localhost/rajan --username root --password cloudera \
--table products_rj \
--export-dir /user/rj_example/prob_3_data \
--input-fields-terminated-by '\t'
# PROBLEM 4:
p=sc.textFile("/user/rj_example/data/products").map(lambda x: (x.split("\t")[0],x.split("\t")[1],x.split("\t")[4])).toDF(["product_id","product_category_id","product_price"])
c=sc.textFile("/user/rj_example/data/categories").map(lambda x: (x.split(",")[0],x.split(",")[2])).toDF(["category_id","category_name"])
p.registerTempTable("p")
c.registerTempTable("c")
finalDF=sqlContext.sql("select c.category_name,p1.minimun_product_price, p1.product_count from c join (select product_category_id,min(product_price) as minimun_product_price, count(1) as product_count from p group by product_category_id)p1 ON p1.product_category_id=c.category_id order by p1.product_count desc")
sqlContext.setConf("spark.sql.parquet.compression.codec","gzip")
finalDF.coalesce(3).write.save("/user/rj_example/prob4/output","parquet")
# PROBLEM 5:
df=sc.textFile("/user/rj_example/data/customers").map(lambda x: x.split(",")).toDF(["customer_id","customer_fname","customer_lname","customer_email","customer_password","customer_street","customer_city","customer_state","customer_zipcode"])
df.show()
sqlContext.setConf("spark.sql.avro.compression.codec","gzip")
df.write.mode("overwrite").format("com.databricks.spark.avro").saveAsTable("rj_problem5",path="/user/rj_example/hive/customers")
df.toJSON().saveAsTextFile("/user/rj_example/prob5/output","org.apache.hadoop.io.compress.GzipCodec")
# PROBLEM 6:
df=sqlContext.read.load("/user/rj_example/parquet_data/orders","parquet")
df1=df.filter(df['order_status']=='COMPLETE')
df1.show()
df2=df1.select('order_id','order_date','order_status')
df2.show()
df2.registerTempTable("df")
finalDF=sqlContext.sql("select order_id, concat(substr(order_date,9,2), '-',substr(order_date,6,2),'-',substr(order_date,1,4)),order_status from (select order_id, to_date(from_unixtime(cast(order_date/1000 as bigint))) as order_date ,order_status from df limit 10)a")
finalDF.map(lambda x: '| '.join(map(str, x))).saveAsTextFile("/user/rj_example/prob6/output")
Thanks a lot for sharing this amazing knowledge with us. This site is fantastic. I always find great knowledge from it. AZ-204: Developing solutions for Microsoft Azure
ReplyDeleteYour Blog is amazing with Informative content. Big Data using Hadoop & Spark
ReplyDelete