Q. If you run select * query in Hive, why it’s not run MapReduce?
It’s an optimization technique. hive.fetch.task.conversion property can (FETCH task) minimize the latency of MapReduce
overhead. When queried SELECT, FILTER, LIMIT queries, this property skip MapReduce and using FETCH task. As a result, Hive can execute query without run MapReduce task
By default, it’s valued “minimal”. Which optimize: SELECT STAR, FILTER on partition columns, LIMIT queries only, where
as another value is “more” which optimize : SELECT, FILTER, LIMIT only (+TABLESAMPLE, virtual columns).
Q.If you run hive as a server, what are the available mechanism for connecting it from the application?
There are following ways by which you can connect with the Hive Server:
1. Thrift Client: Using thrift you can call hive commands from a various programming languages e.g. C++, Java, PHP, Python and Ruby.
2. JDBC Driver : It supports the Type 4 (pure Java) JDBC Driver
3. ODBC Driver: It supports ODBC protocol.
Q.What is serde ? Why you use it? What are different format of Serde ?
The SerDe interface allows you to instruct Hive as to how a record should be processed. A SerDe is a combination of a Serializer and a Deserializer (hence, Ser-De). The Deserializer interface takes a string or binary representation of a record, and translates it into a Java object that Hive can manipulate. The Serializer, however, will take a Java object that Hive has been working with, and turn it into something that Hive can write to HDFS or another supported system. Commonly, Deserializers are used at query time to execute SELECT statements, and Serializers are used when writing data, such as through an INSERT-SELECT statement.
Q. How to start Hive Thrift server ?
To Start Thrift Server use below command :
$hive -service hiveserver
Q. What types of Input Format supported in Hive?
A file format is the way in which information is stored or encoded in a computer file. In Hive it refers to how records are stored inside the file.
TEXTFILE format is a famous input/output format used in Hadoop. In Hive if we define a table as TEXTFILE it can load data of form CSV (Comma Separated Values), delimited by Tabs, Spaces and JSON data.
Sequence files are flat files consisting of binary key-value pairs. When Hive converts queries to MapReduce jobs, it decides on the appropriate key-value pairs to be used for a given record. Sequence files are in a binary format which are able to split and the main use of these files is to club two or more smaller files and make them as a one sequence file.
RCFILE stands of Record Columnar File which is another type of binary file format which offers high compression rate on the top of the rows.
RCFILE is used when we want to perform operations on multiple rows at a time.
RCFILEs are flat files consisting of binary key/value pairs, which shares many similarities with SEQUENCEFILE. RCFILE stores columns of a table in the form of record in a columnar manner.
ORC stands for Optimized Row Columnar which means it can store data in an optimized way than the other file formats. ORC reduces the size of the original data up to 75%. As a result of the speed of data processing also increases.
Q. What is Hcatalog ?
It assists integration with other tools and supplies to read and write interfaces for Pig, Hive, and Map/Reduce.
It provides shared schema and data types for Hadoop tools.You do not have to explicitly type the data structures in each program.
It exposes the information as Rest Interface for external data access.
It also integrates with Sqoop, which is a tool designed to transfer data back and forth between Hadoop and relational databases such as SQL Server and Oracle
It provide APIs and webservice wrapper for accessing metadata in hive metastore.
HCatalog also exposes a REST interface so that you can create custom tools and applications to interact with Hadoop data structures.