Data Representation

How Computers Encode Everything as Binary

Introduction: Everything is Binary

Text, images, music, videos, programs, databases—everything stored on a computer is ultimately represented as sequences of 1s and 0s. This guide explores how abstract concepts like letters, colors, and sounds are encoded into binary data that computers can store and process.

This is perhaps the most profound aspect of digital computing: the same binary foundation that performs arithmetic can also represent a symphony, a photograph, or this very sentence you're reading. The key is encoding—agreed-upon schemes that map real-world information to binary patterns.

The Universal Principle

Any information that can be described precisely can be encoded in binary. The challenge isn't whether it's possible—it's choosing encodings that are efficient, compatible, and preserve the information we care about.

149,813 Unicode characters defined
64 Bits for double-precision float
16.7M Colors in 24-bit images
44,100 Audio samples/sec in CD quality

Integer Representation

Integers are the simplest data type to represent in binary—it's the natural form of binary numbers. However, there are important details about sizes and byte ordering.

Standard Integer Sizes

Size Bits Unsigned Range Signed Range Common Names
1 byte 8 0 to 255 -128 to 127 byte, char, int8
2 bytes 16 0 to 65,535 -32,768 to 32,767 short, int16, word
4 bytes 32 0 to ~4.3 billion -2.1B to 2.1B int, int32, dword
8 bytes 64 0 to ~18 quintillion -9.2×10¹⁸ to 9.2×10¹⁸ long, int64, qword

Endianness: Byte Order Matters

When a multi-byte integer is stored in memory, which byte comes first? This is called endianness, and different systems make different choices.

Example: Storing 0x12345678 in Memory

    Value: 0x12345678 (4 bytes)

    Big-Endian (most significant byte first):
    Address:   0x00  0x01  0x02  0x03
    Content:    12    34    56    78

    Little-Endian (least significant byte first):
    Address:   0x00  0x01  0x02  0x03
    Content:    78    56    34    12

    Same value, different byte order in memory!
                        
Endianness Systems Using It Advantage
Big-Endian Network protocols, SPARC, older Macs Human-readable hex dumps
Little-Endian x86, x64, ARM (usually), modern devices Efficient for some operations
Bi-Endian ARM (configurable), PowerPC Flexibility

Network Byte Order

Network protocols use big-endian (called "network byte order"). When sending data between systems with different endianness, programmers must convert using functions like htons() (host to network short) and ntohl() (network to host long).

Floating-Point Numbers

Integers can't represent fractions or very large/small numbers efficiently. Floating-point representation solves this using a form of scientific notation in binary.

Why We Need Floating-Point

Consider representing 0.1 in binary. There's no exact finite representation—it's a repeating fraction! And how would you represent both 0.0000001 and 100000000 in a fixed number of bits? Fixed-point fails; we need floating-point.

Scientific Notation Parallel

    Decimal Scientific Notation:
    6.022 × 10²³  (Avogadro's number)

    Components:
    - Sign: positive
    - Mantissa (significand): 6.022
    - Base: 10
    - Exponent: 23

    Binary Floating-Point:
    1.101 × 2^5 = 1.625 × 32 = 52

    Same concept, base 2 instead of 10!
                        

IEEE 754 Standard

The IEEE 754 standard defines how floating-point numbers are stored. It's used by virtually all modern CPUs.

Format Total Bits Sign Exponent Mantissa Approx. Range
Half (float16) 16 1 5 10 ±6.5×10⁴
Single (float32) 32 1 8 23 ±3.4×10³⁸
Double (float64) 64 1 11 52 ±1.8×10³⁰⁸

32-bit Float Layout

    Bit:  31 30--------23 22-----------------------0
          S  EEEEEEEE     MMMMMMMMMMMMMMMMMMMMMMM

    S = Sign (0 = positive, 1 = negative)
    E = Exponent (8 bits, biased by 127)
    M = Mantissa (23 bits, with implicit leading 1)

    Example: -13.625 in 32-bit float

    1. Convert to binary: 13.625 = 1101.101
    2. Normalize: 1.101101 × 2³
    3. Sign: 1 (negative)
    4. Exponent: 3 + 127 = 130 = 10000010
    5. Mantissa: 10110100000000000000000 (drop leading 1)

    Result: 1 10000010 10110100000000000000000
            = 0xC1590000
                        

Special Values: Infinity and NaN

IEEE 754 reserves certain bit patterns for special values:

Value Exponent Mantissa Meaning
Zero All 0s All 0s ±0.0
Denormalized All 0s Non-zero Very small numbers
Infinity All 1s All 0s ±∞
NaN All 1s Non-zero Not a Number (0/0, √-1)

Precision and Rounding Errors

Floating-point numbers cannot represent all real numbers exactly. This leads to rounding errors that can accumulate:

The Famous 0.1 + 0.2 Problem

    In most programming languages:
    0.1 + 0.2 = 0.30000000000000004

    Why? 0.1 in binary is:
    0.0001100110011001100110011... (repeating)

    It can't be stored exactly in finite bits!

    This affects all binary floating-point systems.
    Solutions: Use integers (cents instead of dollars),
               or decimal floating-point libraries.
                        

Never Compare Floats for Equality

Instead of if (a == b), use if (abs(a - b) < epsilon) where epsilon is a small tolerance. Exact equality rarely works due to accumulated rounding errors.

Text and Character Encoding

Text is represented by assigning numeric codes to characters. The history of character encoding is a journey from simple 7-bit codes to the universal Unicode standard.

ASCII: The Foundation

ASCII (American Standard Code for Information Interchange) dates to 1963 and uses 7 bits to encode 128 characters:

Range Characters Examples
0-31 Control characters NUL, TAB, LF, CR
32-47 Punctuation, symbols Space ! " # $ % & ' ( )
48-57 Digits 0 1 2 3 4 5 6 7 8 9
58-64 More punctuation : ; < = > ? @
65-90 Uppercase letters A B C ... X Y Z
91-96 Brackets, etc. [ \ ] ^ _ `
97-122 Lowercase letters a b c ... x y z
123-126 More symbols { | } ~
127 Delete DEL

ASCII Patterns Worth Knowing

    Useful ASCII relationships:

    'A' = 65 = 0x41 = 01000001
    'a' = 97 = 0x61 = 01100001
    Difference: 32 (bit 5)

    'a' - 'A' = 32
    To uppercase: char & 0xDF (clear bit 5)
    To lowercase: char | 0x20 (set bit 5)

    '0' = 48 = 0x30
    '9' = 57 = 0x39
    Character to digit: char - '0'
    Digit to character: digit + '0'
                        

Extended ASCII and Code Pages

The 8th bit allows 128 more characters (128-255), but different systems used this differently:

  • ISO-8859-1 (Latin-1): Western European characters (é, ñ, ü)
  • ISO-8859-5: Cyrillic characters
  • Windows-1252: Microsoft's Western European encoding
  • Code Page 437: DOS box-drawing characters

This created a mess: the same byte could represent different characters on different systems!

Unicode: The Universal Standard

Unicode assigns a unique code point to every character in every writing system. Currently over 149,000 characters from 161 scripts.

Unicode Code Points

    Format: U+XXXX (hexadecimal)

    U+0041 = A (Latin uppercase A)
    U+03B1 = α (Greek lowercase alpha)
    U+4E2D = 中 (Chinese character "middle")
    U+1F600 = 😀 (grinning face emoji)
    U+1F4BB = 💻 (laptop emoji)

    Unicode planes:
    U+0000 - U+FFFF   = BMP (Basic Multilingual Plane)
    U+10000 - U+10FFFF = Supplementary planes (emoji, etc.)
                        

UTF-8: The Web's Encoding

UTF-8 is a variable-length encoding that stores Unicode code points in 1-4 bytes. It's backwards-compatible with ASCII and is the dominant encoding on the web.

Code Point Range Bytes Byte Pattern
U+0000 - U+007F 1 0xxxxxxx
U+0080 - U+07FF 2 110xxxxx 10xxxxxx
U+0800 - U+FFFF 3 1110xxxx 10xxxxxx 10xxxxxx
U+10000 - U+10FFFF 4 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Example: Encoding € (Euro sign) in UTF-8

    € = U+20AC

    1. Convert to binary: 0010 0000 1010 1100

    2. U+20AC is in range U+0800-U+FFFF, so 3 bytes

    3. Pattern: 1110xxxx 10xxxxxx 10xxxxxx

    4. Fill in the bits:
       0010 0000 1010 1100
       xxxx xxxxxx xxxxxx

       1110 0010  10 000010  10 101100
         E    2      8    2     A    C

    5. UTF-8 encoding: E2 82 AC (3 bytes)
                        

UTF-16 and UTF-32

Encoding Bytes per char Used By Notes
UTF-8 1-4 Web, Linux, macOS ASCII compatible
UTF-16 2 or 4 Windows, Java, JavaScript Efficient for Asian text
UTF-32 4 Internal processing Fixed width, simple indexing

The BOM (Byte Order Mark)

Some files start with U+FEFF to indicate encoding and endianness. In UTF-8: EF BB BF. In UTF-16 BE: FE FF. In UTF-16 LE: FF FE. Many tools strip or ignore the BOM.

Image Representation

Digital images are grids of tiny colored dots called pixels. Each pixel's color is stored as binary data.

Raster Images: Pixels and Color

A raster image is a 2D array of pixels. Each pixel has a color value:

24-bit Color (True Color)

    Each pixel: 3 bytes (24 bits)

    Red:   8 bits (0-255)
    Green: 8 bits (0-255)
    Blue:  8 bits (0-255)

    Example: Bright orange pixel
    R=255, G=165, B=0
    Hex: FF A5 00
    Binary: 11111111 10100101 00000000

    Total colors: 256 × 256 × 256 = 16,777,216

    Image size calculation:
    1920 × 1080 × 3 bytes = 6,220,800 bytes ≈ 6 MB uncompressed
                        

Color Depth and Palettes

Depth Colors Usage
1-bit 2 Black and white
8-bit 256 GIF, indexed color
16-bit 65,536 High color, older games
24-bit 16.7 million True color, standard photos
32-bit 16.7M + alpha True color with transparency
48-bit 281 trillion Professional photography

Image Compression

Lossless Compression (PNG, GIF)

Reduces file size without losing any data. Uses patterns like run-length encoding (RLE) and dictionary compression. Perfect for screenshots, logos, and graphics with solid colors.

Lossy Compression (JPEG)

Discards information humans can't easily perceive. Divides image into 8×8 blocks, applies DCT (Discrete Cosine Transform), and quantizes coefficients. Great for photographs but causes artifacts in text/sharp edges.

Audio Representation

Sound is a continuous wave of pressure changes. To store it digitally, we must convert it to discrete samples.

Sampling and Quantization

Digital Audio Parameters

    Sample Rate: How often we measure the wave
    - CD quality: 44,100 Hz (44.1 kHz)
    - DVD audio: 48,000 Hz
    - High-res: 96,000 Hz or 192,000 Hz

    Bit Depth: Precision of each sample
    - CD quality: 16-bit (65,536 levels)
    - Professional: 24-bit (16.7 million levels)

    Channels:
    - Mono: 1 channel
    - Stereo: 2 channels
    - 5.1 Surround: 6 channels

    CD Audio data rate:
    44,100 samples/sec × 16 bits × 2 channels
    = 1,411,200 bits/sec ≈ 176 KB/sec
    = 10.6 MB/minute uncompressed
                        

Nyquist-Shannon Theorem

To accurately capture a frequency, you must sample at more than twice that frequency. Human hearing goes up to ~20 kHz, so CD's 44.1 kHz sample rate captures all audible frequencies (with margin for filtering).

Audio Formats

Format Type Typical Bitrate Notes
WAV Uncompressed 1411 kbps (CD) Large files, perfect quality
FLAC Lossless ~900 kbps Perfect quality, smaller
MP3 Lossy 128-320 kbps Widely compatible
AAC Lossy 128-256 kbps Better than MP3 at same rate
Opus Lossy 64-256 kbps Excellent quality/size ratio

Video Representation

Video is a sequence of images (frames) displayed rapidly, typically with synchronized audio.

Uncompressed Video Data Rates

    1080p video at 30 fps:

    Per frame: 1920 × 1080 × 3 bytes = 6.2 MB
    Per second: 6.2 MB × 30 = 186 MB
    Per minute: 186 × 60 = 11.2 GB
    Per hour: 11.2 × 60 = 672 GB

    This is why video compression is essential!

    H.264 compressed 1080p:
    ~5 Mbps = 0.625 MB/sec = 37.5 MB/min
    Compression ratio: ~300:1
                        

Video Compression Techniques

  • Intra-frame (I-frames): Complete images, compressed like JPEG
  • Predictive (P-frames): Store differences from previous frame
  • Bidirectional (B-frames): Differences from both previous and next
  • Motion compensation: Track moving objects between frames

Container vs. Codec

Container (MP4, MKV, AVI): Wraps video, audio, subtitles, metadata together. Codec (H.264, H.265, VP9): The actual compression algorithm. An MP4 file might contain H.264 video with AAC audio.

Structured Data Formats

Beyond primitive data types, complex data structures are also stored as binary (or text-based) formats.

Binary Formats

  • Protocol Buffers (protobuf): Google's efficient binary serialization
  • MessagePack: Like JSON but binary, smaller and faster
  • BSON: Binary JSON, used by MongoDB
  • SQLite: Self-contained database in a single file

Text Formats

  • JSON: Human-readable, widely used for APIs
  • XML: More verbose, supports schemas
  • YAML: Human-friendly configuration files
  • CSV: Simple tabular data

Same Data in Different Formats

    Data: {name: "Alice", age: 30, active: true}

    JSON (35 bytes):
    {"name":"Alice","age":30,"active":true}

    MessagePack (22 bytes):
    83 A4 6E 61 6D 65 A5 41 6C 69 63 65
    A3 61 67 65 1E A6 61 63 74 69 76 65 C3

    Protocol Buffers (~15 bytes):
    Depends on schema, very compact

    Trade-off: Binary is smaller/faster,
    text is human-readable/debuggable.
                        

Practice Problems

Problem 1: How is 0x12345678 stored in little-endian?

Show Solution
    Little-endian stores least significant byte first:

    Address:   0    1    2    3
    Content:  78   56   34   12

    The bytes are reversed compared to how we write the number.
                            

Problem 2: Encode "Hi" in UTF-8

Show Solution
    'H' = U+0048 = ASCII 72 = 0x48
    'i' = U+0069 = ASCII 105 = 0x69

    Both are under U+007F, so 1 byte each:

    UTF-8: 48 69 (2 bytes)
    Binary: 01001000 01101001
                            

Problem 3: What's the file size of a 1-minute 48kHz stereo 24-bit audio?

Show Solution
    48,000 samples/sec × 24 bits × 2 channels × 60 sec
    = 48,000 × 3 bytes × 2 × 60
    = 17,280,000 bytes
    = 17.28 MB uncompressed
                            

Problem 4: Why can't 0.1 + 0.2 equal 0.3 exactly?

Show Solution
    0.1 in binary is a repeating fraction:
    0.0001100110011001100... (repeats forever)

    Like 1/3 in decimal = 0.333... can't be exact,
    0.1 in binary can't be stored exactly.

    When you add two approximations:
    ≈0.1 + ≈0.2 ≈ 0.30000000000000004

    The tiny errors compound.
                            

Summary

You've now explored the fascinating world of data representation—how computers encode the infinite variety of information we work with into simple binary patterns.

  • Integers are stored in fixed-size binary with attention to endianness
  • Floating-point uses sign-exponent-mantissa format (IEEE 754)
  • Text evolved from ASCII to Unicode, with UTF-8 as the web standard
  • Images are grids of color values, compressed with various algorithms
  • Audio samples analog waves at regular intervals
  • Video combines image sequences with audio, heavily compressed

Understanding data representation helps you work with file formats, debug encoding issues, optimize storage, and appreciate the elegant simplicity underlying all digital media.