본문 바로가기

Hive

Hive Serde 만들기 1

Serde??


Serde는 "Serializer와 Deserializer"의 줄임말
Hive는 데이터를 읽고 쓸때 Serde를 활용함.
HDFS files --> InputFileFormat --> <key, value> --> Deserializer --> Row object
Row object --> Serializer --> <key, value> --> OutputFileFormat --> HDFS files

출처 : https://cwiki.apache.org/confluence/display/Hive/DeveloperGuide#DeveloperGuide-HiveSerDe


Serde Template

클라우데라에서 만든 Serde Template 붉은색으로 표시된 부분이 Seriallizer, Deserializer.

package org.serde;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.Properties;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hive.serde.Constants;
import org.apache.hadoop.hive.serde2.SerDe;
import org.apache.hadoop.hive.serde2.SerDeException;
import org.apache.hadoop.hive.serde2.SerDeStats;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.typeinfo.StructTypeInfo;
import org.apache.hadoop.hive.serde2.typeinfo.TypeInfo;
import org.apache.hadoop.hive.serde2.typeinfo.TypeInfoFactory;
import org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;

public class TemplateSerde implements SerDe {
    private StructTypeInfo rowTypeInfo;
    private ObjectInspector rowOI;
    private List<String> colNames;
    private List<Object> row = new ArrayList<Object>();

    /**
     * An initialization function used to gather information about the table.
     * Typically, a SerDe implementation will be interested in the list of
     * column names and their types. That information will be used to help
     * perform actual serialization and deserialization of data.
     */
    @Override
    public void initialize(Configuration conf, Properties tbl)
        throws SerDeException {
      // Get a list of the table's column names.
      String colNamesStr = tbl.getProperty(Constants.LIST_COLUMNS);
      colNames = Arrays.asList(colNamesStr.split(","));

      // Get a list of TypeInfos for the columns. This list lines up with
      // the list of column names.
      String colTypesStr = tbl.getProperty(Constants.LIST_COLUMN_TYPES);
      List<TypeInfo> colTypes = TypeInfoUtils.getTypeInfosFromTypeString(colTypesStr);

      rowTypeInfo = (StructTypeInfo) TypeInfoFactory.getStructTypeInfo(colNames, colTypes);
      rowOI = TypeInfoUtils.getStandardJavaObjectInspectorFromTypeInfo(rowTypeInfo);
    }

    /**
     * This method does the work of deserializing a record into Java objects
     * that Hive can work with via the ObjectInspector interface.
     */
    @Override
    public Object deserialize(Writable blob) throws SerDeException {
      row.clear();
      // Do work to turn the fields in the blob into a set of row fields
      return row;
    }

    /**
     * Return an ObjectInspector for the row of data
     */
    @Override
    public ObjectInspector getObjectInspector() throws SerDeException {
      return rowOI;
    }

    /**
     * Unimplemented
     */
    @Override
    public SerDeStats getSerDeStats() {
      return null;
    }

    /**
     * Return the class that stores the serialized data representation.
     */
    @Override
    public Class<? extends Writable> getSerializedClass() {
      return Text.class;
    }

    /**
     * This method takes an object representing a row of data from Hive, and
     * uses the ObjectInspector to get the data for each column and serialize
     * it.
     */
    @Override
    public Writable serialize(Object obj, ObjectInspector oi)
        throws SerDeException {
      // Take the object and transform it into a serialized representation
      return new Text();
    }
}
출처 : http://blog.cloudera.com/blog/2012/12/how-to-use-a-serde-in-apache-hive/


'Hive' 카테고리의 다른 글

Hive Serde 만들기 3  (0) 2014.03.21
Hive Serde 만들기 2  (0) 2014.03.21