Tuesday, February 5, 2013

Avro Tips

Apache Avro is a binary compressed fixed-schema file format meant for Hadoop. As of Hive 0.9.1 and 0.10, Hive can read Avro files directly. Cloudera also back-ported Avro compatibility for its Cloudera Hive 0.9.0.

Avro is highly structured, supporting nested records and arrays, and that makes it a good match for Hive, whose HiveQL syntax adopted SQL1999-style records/structures and arrays.

Avro is a great space-saving alternative to JSON, especially since it's not possible for Apache Pig to read gz-compressed JSON.

This is a collection of tips on using Avro -- most of them can be found here and there on the web, but here they all are in one place.

Schema tips


Below is an example schema that shows nested structures, nullable fields, optional sub-structures, arrays of structures, and arrays of strings.
{"name": "event", "type": "record", "fields": [
  {"name": "logLevel", "type": ["null", "int"], "default": null},
  {"name": "details", "type": 
    {"type": "record", "name": "detailstype", "fields": [
      {"name": "eventname", "type": ["null", "string"], "default": null},
    ]}
  },
  {"name": "metadata", "type": ["null",
    {"type": "record", "name": "metadatatype", "fields": [
      {"name": "referrer", "type": "string"},
    ]}]
  },
  {"name": "messages",   "type": ["null",
    {"type": "array", "name": "messagestype", "items": 
      {"name": "message", "type": "string"}
    }],
    "default": null
  }
]}
  • Declaring a 1-to-1 sub-record requires the verbose style above, where the "type" is declared to again to be a "type", which is of "record".
  • In order for a field to be "nullable", it must be declared as a UNION with type "null" as shown above for "type" using the JSON array syntax.
  • In order to avoid having to explicitly set a field when writing out an Avro record, it must be declared as "default": null
  • Combining the above two points, in order to declare a field nullable and default: null, the "null" in the UNION must appear first because it is the first type in a UNION list that determines the data type for the default, and "null" is an invalid default for all data types (e.g. "string", "int", etc.) with the exception of the "null" data type.
  • The above applies to sub-records as well. In the example above, "details" is a required sub-record (with an optional field "eventname"), whereas "metadata" is an optional sub-record (with, if the sub-record is present, a required field "referrer").
  • When writing out sub-records using the "generic" schema rather than the "specific" schema, the Java code to instantiate an empty sub-record is
    GenericData.Record childdatum =
    new GenericData.Record(schema.getField(fieldname).schema());
    However, if the sub-record is optional (nullable), then to dig past the UNION, the Java code becomes
    GenericData.Record childdatum =
    new GenericData.Record(schema.getField(fieldname).schema().getTypes().get(1));
    In both cases, the childdatum is populated and added to the parentdatum via
    childdatum.put("aChildStringField", "A string value");
    parentdatum.put(fieldname, childdatum);
  • Appending to existing Avro files in HDFS is not supported, even though appending to existing Avro files on the local filesystem and appending to existing non-Avro HDFS files are both supported. The open one-year-old unassigned JIRA ticket for such a feature is AVRO-1035.
  • To view a .avro file:
    java -jar avro-tools-1.7.3.jar tojson mydata.avro

Hive using Avro


Example schema:
CREATE EXTERNAL TABLE mytable
  ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
  STORED as INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
  OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
  TBLPROPERTIES (
    'avro.schema.url'='hdfs:///user/cloudera/mytable.avsc');
[UPDATE 2013-07-18: The above Hive table creation TBLPROPERTIES no longer works with Hive 0.11. Instead one must use SERDEPROPERTIES.]
  • Although Apache documentation gives examples of 'avro.schema.url'='file:///home/...' that actually doesn't work for any non-trivial Hive query because map/reduce doesn't have access to the local filesystem.
  • It's not required to specify LOCATION for where the external Hive data will be stored. If it is not specified, then Hive creates the directory in HDFS of /user/hive/warehouse/<tablename>
  • It is possible to arbitrarily dump a bunch of Avro files into a Hive table directory, and Hive will automatically read them all when executing a query.
  • Not specific to Avro, but it is possible to specify one or more levels of PARTITION for a Hive table. Each level of PARTITION corresponds to another level of directory underneath /user/hive/warehouse/ (e.g. for a two-level PARTITION one might have a directory /user/hive/warehouse/mytable/datestring=2013-02-01/useridfirstchar=2). Using the point above, one can dump multiple Avro files into one such leaf directory of a PARTITIONed table.
    However, one cannot just merely create partition directories in /user/hive/warehouse/<tablename> because the partition names also need to be added to the Hive metastore, which can be done with HiveQL via either the Hive CLI or JDBC:
    ALTER TABLE mytable ADD IF NOT EXISTS PARTITION (ds='2013-02-01')
    Also, note that contrary to examples in Apache documentation, "date" is not a valid partition name as it is a reserved word. "ds" (for "date string") is a popular alternative.
  • Hive supports SQL1999-style nested structures and nested arrays out of the box, and so is a good match for highly-structured Avro schemas. E.g., below is a valid Hive query against the example schema and CREATE TABLE from above:
    SELECT messages[0].functionname FROM event
    Hive also has functions for array handling such as "explode", which returns array elements from a single column as multiple result set rows. https://cwiki.apache.org/Hive/languagemanual-udf.html#LanguageManualUDF-explode
  • Not specific to Avro, but SHOW CREATE TABLE is not supported in Hive until 0.10. To do so in Hive 0.9, download HiveShowCreateTable.jar from https://issues.apache.org/jira/browse/HIVE-967 and then
    hive --service jar HiveShowCreateTable.jar com.media6.hive2rdbms.job.ShowCreateTable -D db.name=default -D table.name=mytable 2>/dev/null

2 comments:

Unknown said...

Good post - thanks!

author unknown said...

Hi!!
Thanks your post has really been helpful.
However, I have a question for nested records. If I specify a namespace for my parent record, does the child record automatically inherits the namespace as that of parent or I have to explicitly specify its namespace as well?