Skip to main content

CCA 175 Preparation with this 6 practice questions- Try to solve it in 1 hour

As I found very few practice questions available on the internet for CCA 175 - Hadoop & spark developer exam. I set 6 questions exam with the solution provided in the comment section.  If you complete it in less than 1 hour then and then think to apply CCA 175 exam else you need more practice.

Question's prerequisites:
import data from orders table with parquet file format and save data to hdfs path: /user/rj_example/parquet_data/orders
import data from customers table and save data to hdfs path: /user/rj_example/data/customers
import data from customers table with avro file format and save data to hdfs path: /user/rj_example/avro_data/customers
import data from customers table and save data to hdfs path: /user/rj_example/data/categories
import data from products table and save data to hdfs path: /user/rj_example/data/products with '\t' as fields seperator
create local dir 'rj_example'
copy data from /user/rj_example/data/products to local dir 'rj_example'
create hive database rajan_cca175_problems

STEPS for this prerequisites:
sqoop-import --connect jdbc:mysql://localhost/retail_db --username root --password cloudera --table orders --as-parquetfile --target-dir /user/rj_example/parquet_data/orders

sqoop-import --connect jdbc:mysql://localhost/retail_db --username root --password cloudera --table customers --delete-target-dir --target-dir /user/rj_example/data/customers

sqoop-import --connect jdbc:mysql://localhost/retail_db --username root --password cloudera --table customers --as-avrodatafile --target-dir /user/rj_example/avro_data/customers

sqoop-import --connect jdbc:mysql://localhost/retail_db --username root --password cloudera \
--table categories \
--target-dir /user/rj_example/data/categories

sqoop-import --connect jdbc:mysql://localhost/retail_db --username root --password cloudera \
--table products \
--target-dir /user/rj_example/data/products \
--fields-terminated-by '\t'

mkdir rj_example
hadoop fs -get /user/rj_example/data/products rj_example/

____________________________________________________________________________
PROBLEM 1:
import joined dataset from table orders and order_items from retail_db database with following constraint.
output should generate only two files
columns should be sepearated by tab
columns required: order_id, order_item_order_id, order_date, order_status, order_item_subtotal
enclosed with "
save data to hdfs path: /user/rj_example/prob1/output \


PROBLEM 2:
data stored in hdfs path /user/rj_example/parquet_data/orders and /user/rj_example/avro_data/customers
save data as following format:
customer_fname$$customer_lname$$order_count
output should be store in text file with gzip compression at hdfs location /user/rj_example/prob2/output

PROBLEM 3:
create mysql table products_rj in any database having schema like retail_db.products table
export data from local path rj_example/products to mysql table products_rj.

PROBLEM 4:
Take data from hdfs path: /user/rj_example/data/products and process data as following statement.
get products count in each product category. also min product price in each category.
output columns: category_name,minimun_product_price,product_count
output should be store in parquet file with gzip compression.
output should generate 3 file.
output should be sorted by product_count in descending order.
categories data stored in /user/rj_example/data/categories location.
Store output at hdfs: /user/rj_example/prob4/output

PROBLEM 5:
Store data from hdfs path /user/rj_example/data/customers to hive table rj_problem5 in database rajan_cca175_problems
Output should be compressed in snappy and Avro file format. location of this hive table should be /user/rj_example/hive/customers.
Also store source file data to  /user/rj_example/prob5/output in JSON file format with gzip compression.

PROBLEM 6:
Take data from HDFS path: /user/rj_example/parquet_data/orders and process COMPLETE order status data as below.
Output format: order_id| order_date| order_status
order_date should store in format of dd-mm-yyyy
Save this data to hdfs path /user/rj_example/prob6/output in text file format with no compression





Comments

  1. Problem Solution:
    # PROBLEM 1:
    sqoop-import --connect jdbc:mysql://localhost/retail_db --username root --password cloudera \
    --query "select order_id, order_item_order_id, order_date, order_status, order_item_subtotal from orders o join order_items oi ON o.order_id=oi.order_item_order_id where \$CONDITIONS" \
    --fields-terminated-by '\t' \
    -m 2 \
    --enclosed-by '"' \
    --target-dir /user/rj_example/prob1/output \
    --split-by order_item_order_id


    # PROBLEM 2:
    o=sqlContext.read.load("/user/rj_example/parquet_data/orders","parquet")
    c=sqlContext.read.load("/user/rj_example/avro_data/customers","com.databricks.spark.avro")
    joined=o.join(c,o['order_customer_id']==c['customer_id'],"right_outer")
    joined.show()
    joined.registerTempTable("joined")

    finalDF=sqlContext.sql("select concat(customer_fname,'$$',customer_lname,'$$', count(1)) from joined group by customer_fname,customer_lname")
    finalDF.rdd.map(lambda x: x[0]).saveAsTextFile("/user/rj_example/prob2/output","org.apache.hadoop.io.compress.GzipCodec")

    # PROBLEM 3
    create table rajan.products_rj as select * from retail_db.products limit 0;

    sqoop-export --connect jdbc:mysql://localhost/rajan --username root --password cloudera \
    --table products_rj \
    --export-dir /user/rj_example/prob_3_data \
    --input-fields-terminated-by '\t'


    # PROBLEM 4:
    p=sc.textFile("/user/rj_example/data/products").map(lambda x: (x.split("\t")[0],x.split("\t")[1],x.split("\t")[4])).toDF(["product_id","product_category_id","product_price"])
    c=sc.textFile("/user/rj_example/data/categories").map(lambda x: (x.split(",")[0],x.split(",")[2])).toDF(["category_id","category_name"])
    p.registerTempTable("p")
    c.registerTempTable("c")

    finalDF=sqlContext.sql("select c.category_name,p1.minimun_product_price, p1.product_count from c join (select product_category_id,min(product_price) as minimun_product_price, count(1) as product_count from p group by product_category_id)p1 ON p1.product_category_id=c.category_id order by p1.product_count desc")
    sqlContext.setConf("spark.sql.parquet.compression.codec","gzip")
    finalDF.coalesce(3).write.save("/user/rj_example/prob4/output","parquet")

    # PROBLEM 5:
    df=sc.textFile("/user/rj_example/data/customers").map(lambda x: x.split(",")).toDF(["customer_id","customer_fname","customer_lname","customer_email","customer_password","customer_street","customer_city","customer_state","customer_zipcode"])
    df.show()
    sqlContext.setConf("spark.sql.avro.compression.codec","gzip")
    df.write.mode("overwrite").format("com.databricks.spark.avro").saveAsTable("rj_problem5",path="/user/rj_example/hive/customers")

    df.toJSON().saveAsTextFile("/user/rj_example/prob5/output","org.apache.hadoop.io.compress.GzipCodec")


    # PROBLEM 6:
    df=sqlContext.read.load("/user/rj_example/parquet_data/orders","parquet")
    df1=df.filter(df['order_status']=='COMPLETE')
    df1.show()
    df2=df1.select('order_id','order_date','order_status')
    df2.show()
    df2.registerTempTable("df")
    finalDF=sqlContext.sql("select order_id, concat(substr(order_date,9,2), '-',substr(order_date,6,2),'-',substr(order_date,1,4)),order_status from (select order_id, to_date(from_unixtime(cast(order_date/1000 as bigint))) as order_date ,order_status from df limit 10)a")
    finalDF.map(lambda x: '| '.join(map(str, x))).saveAsTextFile("/user/rj_example/prob6/output")




    ReplyDelete
  2. Thanks a lot for sharing this amazing knowledge with us. This site is fantastic. I always find great knowledge from it.  AZ-204: Developing solutions for Microsoft Azure

    ReplyDelete

Post a Comment

Popular posts from this blog

AWS IOT Thing Job

AWS IOT Thing Job AWS Iot Thing Job Creation, new job notification, start job and update the job after downloading firmware through JAVA SDK with UI in JAVAFX | Presigned S3 URL creation This Application is made for firmware download. Refer to this GIT repository:  AWS IOT POC A repository contains 3 projects: Aws-Pre-Signed-Url-Generation: To generate presigned url and use it into job document. NOTE: AWS CLI should be configured Iot-Create-Thing-Job-App: To create iot thing job with UI. NOTE: Access key and secret key should be mentioned in aws-iot-sdk-samples.properties Iot-Start-Update-Thing-Job-App: To get notification for new job and to start job and then get job document from aws. After getting thing job document, it will download firmware zip from mention url and update the status of job to SUCCEDED or FAILED. NOTE: aws-iot-sdk-samples.properties files properties should be mention as per your aws account. JOB Document: sample-job-document.json { "ope

AWS Kinesis - Stream, Firehose, Analytics Overview

AWS Kinesis: AWS Kinesis is managed alternative of Apache Kafka. It can be used for big data real-time stream processing. It can be used for applications logs, metrics, forecast data, IoT. It can be used for streaming processing framework like Spark, NiFi, etc.   Kinesis Capabilities: Kinesis Streams : Streaming data ingest at scale with low latency. It is a data stream. Kinesis Analytics : To perform analytics on real-time streaming data using SQL. You can filter or aggregate data in real time. Kinesis Firehose : To load streams of data into S3, Redshift, Splunk or Elastic Search. It is a delivery stream. Kinesis Data Streams : Streams are divided into shards. To scale up application we can update shard configuration by increasing number of shards. By default shard's data can be retained for 1 Day but you can extend it for 7 days. Multiple application can use same stream. Real-time processing of data with a scale of throughput. Record size should not

AWS IOT JITR (Just in Time registration) with Thing and Policy creation using JAVA

AWS IOT JITR with Thing and Policy creation using JAVA. This POC will provide Just In Time Registration (JITR) of custom certificate and Thing creation with connect policy for AWS IOT Devices. You just need to add name of thing in common name while creation of device certificate and thing will be created with attached policy & certificate and common name as thing name. Project Overview: Get certificate details from certificate id. Parse certificate details and get common name from certificate. Creates IOT policy having action of connect. Creates IOT thing with name from certificate common name. Attach policy and thing to certificate. Activate Certificate. Now your device can connect to AWS using this custom certificate. Step for JITR & Thing creation Create CA Certificate: openssl genrsa -out CACertificate.key 2048 openssl req -x509 -new -nodes -key CACertificate.key -sha256 -days 365 -out CACertificate.pem Enter necessary details like city, country, et