Reference

Avro

Avro is an open row-based data serialization format from the Apache ecosystem. It stores records together with their schema, making it compact and self-describing, and is common in data pipelines and streaming such as Kafka.

Files & formatsGeneral

Avro

Also known as: .avro file, Apache Avro, row-based data format

Avro is an open row-based data serialization format from the Apache ecosystem. It stores records together with their schema, making it compact and self-describing, and is common in data pipelines and streaming such as Kafka.

  • Row-based, schema-carrying data format
  • Compact and self-describing; common with Kafka
  • Row-oriented vs column-oriented Parquet

How Avro works

An Avro file packs records in a compact binary form and embeds the schema (defined in JSON) that describes their fields. Because the schema travels with the data, any reader knows how to interpret the records.

Avro is row-oriented, which suits writing and streaming whole records, and it handles schema evolution gracefully — old and new readers can cope with added or changed fields.

Avro vs Parquet and JSON

Compared with JSON, Avro is far more compact and carries strict types. Compared with Parquet, Avro is row-based and better for streaming and writes, while Parquet is column-based and better for analytical reads.

It is a binary format read by data tools, not a text file you edit by hand. Avro and Parquet frequently appear together in the same data platform.

Related terms

Keep reading the reference.