Data Representation: How Computers Store Everything

Introduction: Everything is Binary

Text, images, music, videos, programs, databases—everything stored on a computer is ultimately represented as sequences of 1s and 0s. This guide explores how abstract concepts like letters, colors, and sounds are encoded into binary data that computers can store and process.

This is perhaps the most profound aspect of digital computing: the same binary foundation that performs arithmetic can also represent a symphony, a photograph, or this very sentence you're reading. The key is encoding—agreed-upon schemes that map real-world information to binary patterns.

The Universal Principle

Any information that can be described precisely can be encoded in binary. The challenge isn't whether it's possible—it's choosing encodings that are efficient, compatible, and preserve the information we care about.

149,813 Unicode characters defined

64 Bits for double-precision float

16.7M Colors in 24-bit images

44,100 Audio samples/sec in CD quality

Integer Representation

Integers are the simplest data type to represent in binary—it's the natural form of binary numbers. However, there are important details about sizes and byte ordering.

Standard Integer Sizes

Size	Bits	Unsigned Range	Signed Range	Common Names
1 byte	8	0 to 255	-128 to 127	byte, char, int8
2 bytes	16	0 to 65,535	-32,768 to 32,767	short, int16, word
4 bytes	32	0 to ~4.3 billion	-2.1B to 2.1B	int, int32, dword
8 bytes	64	0 to ~18 quintillion	-9.2×10¹⁸ to 9.2×10¹⁸	long, int64, qword

Endianness: Byte Order Matters

When a multi-byte integer is stored in memory, which byte comes first? This is called endianness, and different systems make different choices.

Example: Storing 0x12345678 in Memory

    Value: 0x12345678 (4 bytes)

    Big-Endian (most significant byte first):
    Address:   0x00  0x01  0x02  0x03
    Content:    12    34    56    78

    Little-Endian (least significant byte first):
    Address:   0x00  0x01  0x02  0x03
    Content:    78    56    34    12

    Same value, different byte order in memory!

Endianness	Systems Using It	Advantage
Big-Endian	Network protocols, SPARC, older Macs	Human-readable hex dumps
Little-Endian	x86, x64, ARM (usually), modern devices	Efficient for some operations
Bi-Endian	ARM (configurable), PowerPC	Flexibility

Network Byte Order

Network protocols use big-endian (called "network byte order"). When sending data between systems with different endianness, programmers must convert using functions like htons() (host to network short) and ntohl() (network to host long).

Floating-Point Numbers

Integers can't represent fractions or very large/small numbers efficiently. Floating-point representation solves this using a form of scientific notation in binary.

Why We Need Floating-Point

Consider representing 0.1 in binary. There's no exact finite representation—it's a repeating fraction! And how would you represent both 0.0000001 and 100000000 in a fixed number of bits? Fixed-point fails; we need floating-point.

Scientific Notation Parallel

    Decimal Scientific Notation:
    6.022 × 10²³  (Avogadro's number)

    Components:
    - Sign: positive
    - Mantissa (significand): 6.022
    - Base: 10
    - Exponent: 23

    Binary Floating-Point:
    1.101 × 2^5 = 1.625 × 32 = 52

    Same concept, base 2 instead of 10!

IEEE 754 Standard

The IEEE 754 standard defines how floating-point numbers are stored. It's used by virtually all modern CPUs.

Format	Total Bits	Sign	Exponent	Mantissa	Approx. Range
Half (float16)	16	1	5	10	±6.5×10⁴
Single (float32)	32	1	8	23	±3.4×10³⁸
Double (float64)	64	1	11	52	±1.8×10³⁰⁸

32-bit Float Layout

    Bit:  31 30--------23 22-----------------------0
          S  EEEEEEEE     MMMMMMMMMMMMMMMMMMMMMMM

    S = Sign (0 = positive, 1 = negative)
    E = Exponent (8 bits, biased by 127)
    M = Mantissa (23 bits, with implicit leading 1)

    Example: -13.625 in 32-bit float

    1. Convert to binary: 13.625 = 1101.101
    2. Normalize: 1.101101 × 2³
    3. Sign: 1 (negative)
    4. Exponent: 3 + 127 = 130 = 10000010
    5. Mantissa: 10110100000000000000000 (drop leading 1)

    Result: 1 10000010 10110100000000000000000
            = 0xC1590000

Special Values: Infinity and NaN

IEEE 754 reserves certain bit patterns for special values:

Value	Exponent	Mantissa	Meaning
Zero	All 0s	All 0s	±0.0
Denormalized	All 0s	Non-zero	Very small numbers
Infinity	All 1s	All 0s	±∞
NaN	All 1s	Non-zero	Not a Number (0/0, √-1)

Precision and Rounding Errors

Floating-point numbers cannot represent all real numbers exactly. This leads to rounding errors that can accumulate:

The Famous 0.1 + 0.2 Problem

    In most programming languages:
    0.1 + 0.2 = 0.30000000000000004

    Why? 0.1 in binary is:
    0.0001100110011001100110011... (repeating)

    It can't be stored exactly in finite bits!

    This affects all binary floating-point systems.
    Solutions: Use integers (cents instead of dollars),
               or decimal floating-point libraries.

Never Compare Floats for Equality

Instead of if (a == b), use if (abs(a - b) < epsilon) where epsilon is a small tolerance. Exact equality rarely works due to accumulated rounding errors.

Text and Character Encoding

Text is represented by assigning numeric codes to characters. The history of character encoding is a journey from simple 7-bit codes to the universal Unicode standard.

ASCII: The Foundation

ASCII (American Standard Code for Information Interchange) dates to 1963 and uses 7 bits to encode 128 characters:

Range	Characters	Examples
0-31	Control characters	NUL, TAB, LF, CR
32-47	Punctuation, symbols	Space ! " # $ % & ' ( )
48-57	Digits	0 1 2 3 4 5 6 7 8 9
58-64	More punctuation	: ; < = > ? @
65-90	Uppercase letters	A B C ... X Y Z
91-96	Brackets, etc.	[ \ ] ^ _ `
97-122	Lowercase letters	a b c ... x y z
123-126	More symbols	{ \| } ~
127	Delete	DEL

ASCII Patterns Worth Knowing

    Useful ASCII relationships:

    'A' = 65 = 0x41 = 01000001
    'a' = 97 = 0x61 = 01100001
    Difference: 32 (bit 5)

    'a' - 'A' = 32
    To uppercase: char & 0xDF (clear bit 5)
    To lowercase: char | 0x20 (set bit 5)

    '0' = 48 = 0x30
    '9' = 57 = 0x39
    Character to digit: char - '0'
    Digit to character: digit + '0'

Extended ASCII and Code Pages

The 8th bit allows 128 more characters (128-255), but different systems used this differently:

ISO-8859-1 (Latin-1): Western European characters (é, ñ, ü)
ISO-8859-5: Cyrillic characters
Windows-1252: Microsoft's Western European encoding
Code Page 437: DOS box-drawing characters

This created a mess: the same byte could represent different characters on different systems!

Unicode: The Universal Standard

Unicode assigns a unique code point to every character in every writing system. Currently over 149,000 characters from 161 scripts.

Unicode Code Points

    Format: U+XXXX (hexadecimal)

    U+0041 = A (Latin uppercase A)
    U+03B1 = α (Greek lowercase alpha)
    U+4E2D = 中 (Chinese character "middle")
    U+1F600 = 😀 (grinning face emoji)
    U+1F4BB = 💻 (laptop emoji)

    Unicode planes:
    U+0000 - U+FFFF   = BMP (Basic Multilingual Plane)
    U+10000 - U+10FFFF = Supplementary planes (emoji, etc.)

UTF-8: The Web's Encoding

UTF-8 is a variable-length encoding that stores Unicode code points in 1-4 bytes. It's backwards-compatible with ASCII and is the dominant encoding on the web.

Code Point Range	Bytes	Byte Pattern
U+0000 - U+007F	1	0xxxxxxx
U+0080 - U+07FF	2	110xxxxx 10xxxxxx
U+0800 - U+FFFF	3	1110xxxx 10xxxxxx 10xxxxxx
U+10000 - U+10FFFF	4	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Example: Encoding € (Euro sign) in UTF-8

    € = U+20AC

    1. Convert to binary: 0010 0000 1010 1100

    2. U+20AC is in range U+0800-U+FFFF, so 3 bytes

    3. Pattern: 1110xxxx 10xxxxxx 10xxxxxx

    4. Fill in the bits:
       0010 0000 1010 1100
       xxxx xxxxxx xxxxxx

       1110 0010  10 000010  10 101100
         E    2      8    2     A    C

    5. UTF-8 encoding: E2 82 AC (3 bytes)

UTF-16 and UTF-32

Encoding	Bytes per char	Used By	Notes
UTF-8	1-4	Web, Linux, macOS	ASCII compatible
UTF-16	2 or 4	Windows, Java, JavaScript	Efficient for Asian text
UTF-32	4	Internal processing	Fixed width, simple indexing

The BOM (Byte Order Mark)

Some files start with U+FEFF to indicate encoding and endianness. In UTF-8: EF BB BF. In UTF-16 BE: FE FF. In UTF-16 LE: FF FE. Many tools strip or ignore the BOM.

Image Representation

Digital images are grids of tiny colored dots called pixels. Each pixel's color is stored as binary data.

Raster Images: Pixels and Color

A raster image is a 2D array of pixels. Each pixel has a color value:

24-bit Color (True Color)

    Each pixel: 3 bytes (24 bits)

    Red:   8 bits (0-255)
    Green: 8 bits (0-255)
    Blue:  8 bits (0-255)

    Example: Bright orange pixel
    R=255, G=165, B=0
    Hex: FF A5 00
    Binary: 11111111 10100101 00000000

    Total colors: 256 × 256 × 256 = 16,777,216

    Image size calculation:
    1920 × 1080 × 3 bytes = 6,220,800 bytes ≈ 6 MB uncompressed

Color Depth and Palettes

Depth	Colors	Usage
1-bit	2	Black and white
8-bit	256	GIF, indexed color
16-bit	65,536	High color, older games
24-bit	16.7 million	True color, standard photos
32-bit	16.7M + alpha	True color with transparency
48-bit	281 trillion	Professional photography

Image Compression

Lossless Compression (PNG, GIF)

Reduces file size without losing any data. Uses patterns like run-length encoding (RLE) and dictionary compression. Perfect for screenshots, logos, and graphics with solid colors.

Lossy Compression (JPEG)

Discards information humans can't easily perceive. Divides image into 8×8 blocks, applies DCT (Discrete Cosine Transform), and quantizes coefficients. Great for photographs but causes artifacts in text/sharp edges.

Audio Representation

Sound is a continuous wave of pressure changes. To store it digitally, we must convert it to discrete samples.

Sampling and Quantization

Digital Audio Parameters

    Sample Rate: How often we measure the wave
    - CD quality: 44,100 Hz (44.1 kHz)
    - DVD audio: 48,000 Hz
    - High-res: 96,000 Hz or 192,000 Hz

    Bit Depth: Precision of each sample
    - CD quality: 16-bit (65,536 levels)
    - Professional: 24-bit (16.7 million levels)

    Channels:
    - Mono: 1 channel
    - Stereo: 2 channels
    - 5.1 Surround: 6 channels

    CD Audio data rate:
    44,100 samples/sec × 16 bits × 2 channels
    = 1,411,200 bits/sec ≈ 176 KB/sec
    = 10.6 MB/minute uncompressed

Nyquist-Shannon Theorem

To accurately capture a frequency, you must sample at more than twice that frequency. Human hearing goes up to ~20 kHz, so CD's 44.1 kHz sample rate captures all audible frequencies (with margin for filtering).

Audio Formats

Format	Type	Typical Bitrate	Notes
WAV	Uncompressed	1411 kbps (CD)	Large files, perfect quality
FLAC	Lossless	~900 kbps	Perfect quality, smaller
MP3	Lossy	128-320 kbps	Widely compatible
AAC	Lossy	128-256 kbps	Better than MP3 at same rate
Opus	Lossy	64-256 kbps	Excellent quality/size ratio

Video Representation

Video is a sequence of images (frames) displayed rapidly, typically with synchronized audio.

Uncompressed Video Data Rates

    1080p video at 30 fps:

    Per frame: 1920 × 1080 × 3 bytes = 6.2 MB
    Per second: 6.2 MB × 30 = 186 MB
    Per minute: 186 × 60 = 11.2 GB
    Per hour: 11.2 × 60 = 672 GB

    This is why video compression is essential!

    H.264 compressed 1080p:
    ~5 Mbps = 0.625 MB/sec = 37.5 MB/min
    Compression ratio: ~300:1

Video Compression Techniques

Intra-frame (I-frames): Complete images, compressed like JPEG
Predictive (P-frames): Store differences from previous frame
Bidirectional (B-frames): Differences from both previous and next
Motion compensation: Track moving objects between frames

Container vs. Codec

Container (MP4, MKV, AVI): Wraps video, audio, subtitles, metadata together. Codec (H.264, H.265, VP9): The actual compression algorithm. An MP4 file might contain H.264 video with AAC audio.

Structured Data Formats

Beyond primitive data types, complex data structures are also stored as binary (or text-based) formats.

Binary Formats

Protocol Buffers (protobuf): Google's efficient binary serialization
MessagePack: Like JSON but binary, smaller and faster
BSON: Binary JSON, used by MongoDB
SQLite: Self-contained database in a single file

Text Formats

JSON: Human-readable, widely used for APIs
XML: More verbose, supports schemas
YAML: Human-friendly configuration files
CSV: Simple tabular data

Same Data in Different Formats

    Data: {name: "Alice", age: 30, active: true}

    JSON (35 bytes):
    {"name":"Alice","age":30,"active":true}

    MessagePack (22 bytes):
    83 A4 6E 61 6D 65 A5 41 6C 69 63 65
    A3 61 67 65 1E A6 61 63 74 69 76 65 C3

    Protocol Buffers (~15 bytes):
    Depends on schema, very compact

    Trade-off: Binary is smaller/faster,
    text is human-readable/debuggable.

Practice Problems

Problem 1: How is 0x12345678 stored in little-endian?

Show Solution

    Little-endian stores least significant byte first:

    Address:   0    1    2    3
    Content:  78   56   34   12

    The bytes are reversed compared to how we write the number.

Problem 2: Encode "Hi" in UTF-8

Show Solution

    'H' = U+0048 = ASCII 72 = 0x48
    'i' = U+0069 = ASCII 105 = 0x69

    Both are under U+007F, so 1 byte each:

    UTF-8: 48 69 (2 bytes)
    Binary: 01001000 01101001

Problem 3: What's the file size of a 1-minute 48kHz stereo 24-bit audio?

Show Solution

    48,000 samples/sec × 24 bits × 2 channels × 60 sec
    = 48,000 × 3 bytes × 2 × 60
    = 17,280,000 bytes
    = 17.28 MB uncompressed

Problem 4: Why can't 0.1 + 0.2 equal 0.3 exactly?

Show Solution

    0.1 in binary is a repeating fraction:
    0.0001100110011001100... (repeats forever)

    Like 1/3 in decimal = 0.333... can't be exact,
    0.1 in binary can't be stored exactly.

    When you add two approximations:
    ≈0.1 + ≈0.2 ≈ 0.30000000000000004

    The tiny errors compound.

Summary

You've now explored the fascinating world of data representation—how computers encode the infinite variety of information we work with into simple binary patterns.

Integers are stored in fixed-size binary with attention to endianness
Floating-point uses sign-exponent-mantissa format (IEEE 754)
Text evolved from ASCII to Unicode, with UTF-8 as the web standard
Images are grids of color values, compressed with various algorithms
Audio samples analog waves at regular intervals
Video combines image sequences with audio, heavily compressed

Understanding data representation helps you work with file formats, debug encoding issues, optimize storage, and appreciate the elegant simplicity underlying all digital media.

Continue Learning

Next Topic Binary Conversion Methods → Back to Binary 101 Overview