How does gz work

Content on WhatAnswers is provided "as is" for informational purposes. While we strive for accuracy, we make no guarantees. Content is AI-assisted and should not be used as professional advice.

Last updated: April 8, 2026

Quick Answer: Gzip (gz) is a file compression format and software application created by Jean-loup Gailly and Mark Adler in 1992, based on the DEFLATE algorithm. It typically achieves compression ratios of 2:1 to 3:1 for text files, reducing file sizes by 50-70%. Gzip works by replacing repeated strings with pointers to previous occurrences and using Huffman coding for efficient bit representation. It remains widely used for web content delivery, software distribution, and log file compression due to its balance of speed and compression efficiency.

Key Facts

Overview

Gzip (GNU zip) is a widely used file compression format and software application that originated in the early 1990s as part of the GNU Project. Created by Jean-loup Gailly and Mark Adler in 1992, gzip was developed as a free software replacement for the proprietary compress program, which used the LZW algorithm and was subject to patent restrictions. The name "gzip" stands for GNU zip, reflecting its origins in Richard Stallman's GNU Project. Gzip quickly became popular in Unix-like systems and was standardized in RFC 1952 in 1996. The format uses the .gz file extension and is commonly used for compressing single files, though it's often combined with tar to create .tar.gz archives for multiple files. By 2023, gzip remained one of the most common compression formats on the web, with approximately 40% of websites using it for HTTP compression according to W3Techs surveys. Its longevity stems from its open-source nature, good compression ratios, and widespread support across operating systems and applications.

How It Works

Gzip operates using the DEFLATE compression algorithm, which combines two techniques: LZ77 (Lempel-Ziv 1977) and Huffman coding. First, the LZ77 algorithm scans the input data for repeated sequences of bytes. When it finds a match to a previous occurrence (within a 32KB sliding window), it replaces the repeated sequence with a pair of numbers: a distance back to the previous occurrence and the length of the match. This process eliminates redundancy in the data. Next, Huffman coding takes the resulting stream of literals and length-distance pairs and assigns variable-length codes to each symbol, with more frequent symbols receiving shorter codes. The algorithm builds optimal prefix codes based on symbol frequencies, creating a binary tree structure. Gzip typically uses two Huffman trees: one for literals and lengths, and another for distances. The compressed output includes these code trees followed by the encoded data. During decompression, gzip reverses this process: it reads the Huffman trees, decodes the bitstream back into literals and pointers, then uses the LZ77 pointers to reconstruct the original data. The entire process operates on a per-file basis, with each file compressed independently.

Why It Matters

Gzip's significance extends across multiple domains due to its efficiency and ubiquity. On the web, gzip compression reduces bandwidth usage by 50-70% for text-based content like HTML, CSS, and JavaScript, leading to faster page loads and reduced hosting costs. Major web servers like Apache and Nginx include built-in gzip support, and content delivery networks (CDNs) use it extensively. In software distribution, gzip compresses source code archives and package repositories, saving storage space and download time. System administrators rely on gzip for log file compression, where it can reduce multi-gigabyte logs to manageable sizes for archival. The format's open specification and lack of patent restrictions enabled widespread adoption in open-source software, influencing subsequent compression tools like zlib and 7-Zip. While newer algorithms like Brotli and Zstandard offer better compression ratios, gzip remains the baseline due to its universal support, making it essential for backward compatibility and systems with limited computational resources.

Sources

  1. WikipediaCC-BY-SA-4.0

Missing an answer?

Suggest a question and we'll generate an answer for it.