Reference

Parquet

Parquet is an open columnar storage format for large datasets, widely used in data analytics and big-data tools. By storing data column by column with compression, it makes analytical queries fast and files much smaller than CSV.

Files & formatsGeneral

Parquet

Also known as: .parquet file, Apache Parquet, columnar storage

Parquet is an open columnar storage format for large datasets, widely used in data analytics and big-data tools. By storing data column by column with compression, it makes analytical queries fast and files much smaller than CSV.

  • Open columnar storage format for analytics
  • Compresses well; smaller and faster than CSV
  • Binary; read with data tools, not a text editor

Why columnar storage matters

A CSV stores data row by row. Parquet stores it column by column, so a query that reads only a few columns skips the rest entirely, and similar values in a column compress extremely well.

That design makes Parquet the default for analytics engines and data lakes. It is a binary format, not human-readable, and is meant to be read by data tools rather than opened in a text editor.

Parquet vs CSV and other formats

Compared with CSV, Parquet files are typically far smaller for the same data and much faster to query, while also preserving column data types. The trade-off is that you need a library or tool to read them.

Within the big-data world, Parquet is column-oriented while Avro is row-oriented; the two are often used together in data pipelines.

Related terms

Keep reading the reference.

Act on it

Guides and tools for this topic.