Introduction: Everything is Binary
Text, images, music, videos, programs, databases—everything stored on a computer is ultimately represented as sequences of 1s and 0s. This guide explores how abstract concepts like letters, colors, and sounds are encoded into binary data that computers can store and process.
This is perhaps the most profound aspect of digital computing: the same binary foundation that performs arithmetic can also represent a symphony, a photograph, or this very sentence you're reading. The key is encoding—agreed-upon schemes that map real-world information to binary patterns.
The Universal Principle
Any information that can be described precisely can be encoded in binary. The challenge isn't whether it's possible—it's choosing encodings that are efficient, compatible, and preserve the information we care about.
Integer Representation
Integers are the simplest data type to represent in binary—it's the natural form of binary numbers. However, there are important details about sizes and byte ordering.
Standard Integer Sizes
| Size | Bits | Unsigned Range | Signed Range | Common Names |
|---|---|---|---|---|
| 1 byte | 8 | 0 to 255 | -128 to 127 | byte, char, int8 |
| 2 bytes | 16 | 0 to 65,535 | -32,768 to 32,767 | short, int16, word |
| 4 bytes | 32 | 0 to ~4.3 billion | -2.1B to 2.1B | int, int32, dword |
| 8 bytes | 64 | 0 to ~18 quintillion | -9.2×10¹⁸ to 9.2×10¹⁸ | long, int64, qword |
Endianness: Byte Order Matters
When a multi-byte integer is stored in memory, which byte comes first? This is called endianness, and different systems make different choices.
Example: Storing 0x12345678 in Memory
Value: 0x12345678 (4 bytes)
Big-Endian (most significant byte first):
Address: 0x00 0x01 0x02 0x03
Content: 12 34 56 78
Little-Endian (least significant byte first):
Address: 0x00 0x01 0x02 0x03
Content: 78 56 34 12
Same value, different byte order in memory!
| Endianness | Systems Using It | Advantage |
|---|---|---|
| Big-Endian | Network protocols, SPARC, older Macs | Human-readable hex dumps |
| Little-Endian | x86, x64, ARM (usually), modern devices | Efficient for some operations |
| Bi-Endian | ARM (configurable), PowerPC | Flexibility |
Network Byte Order
Network protocols use big-endian (called "network byte order"). When sending data between systems with different endianness, programmers must convert using functions like htons() (host to network short) and ntohl() (network to host long).
Floating-Point Numbers
Integers can't represent fractions or very large/small numbers efficiently. Floating-point representation solves this using a form of scientific notation in binary.
Why We Need Floating-Point
Consider representing 0.1 in binary. There's no exact finite representation—it's a repeating fraction! And how would you represent both 0.0000001 and 100000000 in a fixed number of bits? Fixed-point fails; we need floating-point.
Scientific Notation Parallel
Decimal Scientific Notation:
6.022 × 10²³ (Avogadro's number)
Components:
- Sign: positive
- Mantissa (significand): 6.022
- Base: 10
- Exponent: 23
Binary Floating-Point:
1.101 × 2^5 = 1.625 × 32 = 52
Same concept, base 2 instead of 10!
IEEE 754 Standard
The IEEE 754 standard defines how floating-point numbers are stored. It's used by virtually all modern CPUs.
| Format | Total Bits | Sign | Exponent | Mantissa | Approx. Range |
|---|---|---|---|---|---|
| Half (float16) | 16 | 1 | 5 | 10 | ±6.5×10⁴ |
| Single (float32) | 32 | 1 | 8 | 23 | ±3.4×10³⁸ |
| Double (float64) | 64 | 1 | 11 | 52 | ±1.8×10³⁰⁸ |
32-bit Float Layout
Bit: 31 30--------23 22-----------------------0
S EEEEEEEE MMMMMMMMMMMMMMMMMMMMMMM
S = Sign (0 = positive, 1 = negative)
E = Exponent (8 bits, biased by 127)
M = Mantissa (23 bits, with implicit leading 1)
Example: -13.625 in 32-bit float
1. Convert to binary: 13.625 = 1101.101
2. Normalize: 1.101101 × 2³
3. Sign: 1 (negative)
4. Exponent: 3 + 127 = 130 = 10000010
5. Mantissa: 10110100000000000000000 (drop leading 1)
Result: 1 10000010 10110100000000000000000
= 0xC1590000
Special Values: Infinity and NaN
IEEE 754 reserves certain bit patterns for special values:
| Value | Exponent | Mantissa | Meaning |
|---|---|---|---|
| Zero | All 0s | All 0s | ±0.0 |
| Denormalized | All 0s | Non-zero | Very small numbers |
| Infinity | All 1s | All 0s | ±∞ |
| NaN | All 1s | Non-zero | Not a Number (0/0, √-1) |
Precision and Rounding Errors
Floating-point numbers cannot represent all real numbers exactly. This leads to rounding errors that can accumulate:
The Famous 0.1 + 0.2 Problem
In most programming languages:
0.1 + 0.2 = 0.30000000000000004
Why? 0.1 in binary is:
0.0001100110011001100110011... (repeating)
It can't be stored exactly in finite bits!
This affects all binary floating-point systems.
Solutions: Use integers (cents instead of dollars),
or decimal floating-point libraries.
Never Compare Floats for Equality
Instead of if (a == b), use if (abs(a - b) < epsilon) where epsilon is a small tolerance. Exact equality rarely works due to accumulated rounding errors.
Text and Character Encoding
Text is represented by assigning numeric codes to characters. The history of character encoding is a journey from simple 7-bit codes to the universal Unicode standard.
ASCII: The Foundation
ASCII (American Standard Code for Information Interchange) dates to 1963 and uses 7 bits to encode 128 characters:
| Range | Characters | Examples |
|---|---|---|
| 0-31 | Control characters | NUL, TAB, LF, CR |
| 32-47 | Punctuation, symbols | Space ! " # $ % & ' ( ) |
| 48-57 | Digits | 0 1 2 3 4 5 6 7 8 9 |
| 58-64 | More punctuation | : ; < = > ? @ |
| 65-90 | Uppercase letters | A B C ... X Y Z |
| 91-96 | Brackets, etc. | [ \ ] ^ _ ` |
| 97-122 | Lowercase letters | a b c ... x y z |
| 123-126 | More symbols | { | } ~ |
| 127 | Delete | DEL |
ASCII Patterns Worth Knowing
Useful ASCII relationships:
'A' = 65 = 0x41 = 01000001
'a' = 97 = 0x61 = 01100001
Difference: 32 (bit 5)
'a' - 'A' = 32
To uppercase: char & 0xDF (clear bit 5)
To lowercase: char | 0x20 (set bit 5)
'0' = 48 = 0x30
'9' = 57 = 0x39
Character to digit: char - '0'
Digit to character: digit + '0'
Extended ASCII and Code Pages
The 8th bit allows 128 more characters (128-255), but different systems used this differently:
- ISO-8859-1 (Latin-1): Western European characters (é, ñ, ü)
- ISO-8859-5: Cyrillic characters
- Windows-1252: Microsoft's Western European encoding
- Code Page 437: DOS box-drawing characters
This created a mess: the same byte could represent different characters on different systems!
Unicode: The Universal Standard
Unicode assigns a unique code point to every character in every writing system. Currently over 149,000 characters from 161 scripts.
Unicode Code Points
Format: U+XXXX (hexadecimal)
U+0041 = A (Latin uppercase A)
U+03B1 = α (Greek lowercase alpha)
U+4E2D = 中 (Chinese character "middle")
U+1F600 = 😀 (grinning face emoji)
U+1F4BB = 💻 (laptop emoji)
Unicode planes:
U+0000 - U+FFFF = BMP (Basic Multilingual Plane)
U+10000 - U+10FFFF = Supplementary planes (emoji, etc.)
UTF-8: The Web's Encoding
UTF-8 is a variable-length encoding that stores Unicode code points in 1-4 bytes. It's backwards-compatible with ASCII and is the dominant encoding on the web.
| Code Point Range | Bytes | Byte Pattern |
|---|---|---|
| U+0000 - U+007F | 1 | 0xxxxxxx |
| U+0080 - U+07FF | 2 | 110xxxxx 10xxxxxx |
| U+0800 - U+FFFF | 3 | 1110xxxx 10xxxxxx 10xxxxxx |
| U+10000 - U+10FFFF | 4 | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
Example: Encoding € (Euro sign) in UTF-8
€ = U+20AC
1. Convert to binary: 0010 0000 1010 1100
2. U+20AC is in range U+0800-U+FFFF, so 3 bytes
3. Pattern: 1110xxxx 10xxxxxx 10xxxxxx
4. Fill in the bits:
0010 0000 1010 1100
xxxx xxxxxx xxxxxx
1110 0010 10 000010 10 101100
E 2 8 2 A C
5. UTF-8 encoding: E2 82 AC (3 bytes)
UTF-16 and UTF-32
| Encoding | Bytes per char | Used By | Notes |
|---|---|---|---|
| UTF-8 | 1-4 | Web, Linux, macOS | ASCII compatible |
| UTF-16 | 2 or 4 | Windows, Java, JavaScript | Efficient for Asian text |
| UTF-32 | 4 | Internal processing | Fixed width, simple indexing |
The BOM (Byte Order Mark)
Some files start with U+FEFF to indicate encoding and endianness. In UTF-8: EF BB BF. In UTF-16 BE: FE FF. In UTF-16 LE: FF FE. Many tools strip or ignore the BOM.
Image Representation
Digital images are grids of tiny colored dots called pixels. Each pixel's color is stored as binary data.
Raster Images: Pixels and Color
A raster image is a 2D array of pixels. Each pixel has a color value:
24-bit Color (True Color)
Each pixel: 3 bytes (24 bits)
Red: 8 bits (0-255)
Green: 8 bits (0-255)
Blue: 8 bits (0-255)
Example: Bright orange pixel
R=255, G=165, B=0
Hex: FF A5 00
Binary: 11111111 10100101 00000000
Total colors: 256 × 256 × 256 = 16,777,216
Image size calculation:
1920 × 1080 × 3 bytes = 6,220,800 bytes ≈ 6 MB uncompressed
Color Depth and Palettes
| Depth | Colors | Usage |
|---|---|---|
| 1-bit | 2 | Black and white |
| 8-bit | 256 | GIF, indexed color |
| 16-bit | 65,536 | High color, older games |
| 24-bit | 16.7 million | True color, standard photos |
| 32-bit | 16.7M + alpha | True color with transparency |
| 48-bit | 281 trillion | Professional photography |
Image Compression
Lossless Compression (PNG, GIF)
Reduces file size without losing any data. Uses patterns like run-length encoding (RLE) and dictionary compression. Perfect for screenshots, logos, and graphics with solid colors.
Lossy Compression (JPEG)
Discards information humans can't easily perceive. Divides image into 8×8 blocks, applies DCT (Discrete Cosine Transform), and quantizes coefficients. Great for photographs but causes artifacts in text/sharp edges.
Audio Representation
Sound is a continuous wave of pressure changes. To store it digitally, we must convert it to discrete samples.
Sampling and Quantization
Digital Audio Parameters
Sample Rate: How often we measure the wave
- CD quality: 44,100 Hz (44.1 kHz)
- DVD audio: 48,000 Hz
- High-res: 96,000 Hz or 192,000 Hz
Bit Depth: Precision of each sample
- CD quality: 16-bit (65,536 levels)
- Professional: 24-bit (16.7 million levels)
Channels:
- Mono: 1 channel
- Stereo: 2 channels
- 5.1 Surround: 6 channels
CD Audio data rate:
44,100 samples/sec × 16 bits × 2 channels
= 1,411,200 bits/sec ≈ 176 KB/sec
= 10.6 MB/minute uncompressed
Nyquist-Shannon Theorem
To accurately capture a frequency, you must sample at more than twice that frequency. Human hearing goes up to ~20 kHz, so CD's 44.1 kHz sample rate captures all audible frequencies (with margin for filtering).
Audio Formats
| Format | Type | Typical Bitrate | Notes |
|---|---|---|---|
| WAV | Uncompressed | 1411 kbps (CD) | Large files, perfect quality |
| FLAC | Lossless | ~900 kbps | Perfect quality, smaller |
| MP3 | Lossy | 128-320 kbps | Widely compatible |
| AAC | Lossy | 128-256 kbps | Better than MP3 at same rate |
| Opus | Lossy | 64-256 kbps | Excellent quality/size ratio |
Video Representation
Video is a sequence of images (frames) displayed rapidly, typically with synchronized audio.
Uncompressed Video Data Rates
1080p video at 30 fps:
Per frame: 1920 × 1080 × 3 bytes = 6.2 MB
Per second: 6.2 MB × 30 = 186 MB
Per minute: 186 × 60 = 11.2 GB
Per hour: 11.2 × 60 = 672 GB
This is why video compression is essential!
H.264 compressed 1080p:
~5 Mbps = 0.625 MB/sec = 37.5 MB/min
Compression ratio: ~300:1
Video Compression Techniques
- Intra-frame (I-frames): Complete images, compressed like JPEG
- Predictive (P-frames): Store differences from previous frame
- Bidirectional (B-frames): Differences from both previous and next
- Motion compensation: Track moving objects between frames
Container vs. Codec
Container (MP4, MKV, AVI): Wraps video, audio, subtitles, metadata together. Codec (H.264, H.265, VP9): The actual compression algorithm. An MP4 file might contain H.264 video with AAC audio.
Structured Data Formats
Beyond primitive data types, complex data structures are also stored as binary (or text-based) formats.
Binary Formats
- Protocol Buffers (protobuf): Google's efficient binary serialization
- MessagePack: Like JSON but binary, smaller and faster
- BSON: Binary JSON, used by MongoDB
- SQLite: Self-contained database in a single file
Text Formats
- JSON: Human-readable, widely used for APIs
- XML: More verbose, supports schemas
- YAML: Human-friendly configuration files
- CSV: Simple tabular data
Same Data in Different Formats
Data: {name: "Alice", age: 30, active: true}
JSON (35 bytes):
{"name":"Alice","age":30,"active":true}
MessagePack (22 bytes):
83 A4 6E 61 6D 65 A5 41 6C 69 63 65
A3 61 67 65 1E A6 61 63 74 69 76 65 C3
Protocol Buffers (~15 bytes):
Depends on schema, very compact
Trade-off: Binary is smaller/faster,
text is human-readable/debuggable.
Practice Problems
Problem 1: How is 0x12345678 stored in little-endian?
Show Solution
Little-endian stores least significant byte first:
Address: 0 1 2 3
Content: 78 56 34 12
The bytes are reversed compared to how we write the number.
Problem 2: Encode "Hi" in UTF-8
Show Solution
'H' = U+0048 = ASCII 72 = 0x48
'i' = U+0069 = ASCII 105 = 0x69
Both are under U+007F, so 1 byte each:
UTF-8: 48 69 (2 bytes)
Binary: 01001000 01101001
Problem 3: What's the file size of a 1-minute 48kHz stereo 24-bit audio?
Show Solution
48,000 samples/sec × 24 bits × 2 channels × 60 sec
= 48,000 × 3 bytes × 2 × 60
= 17,280,000 bytes
= 17.28 MB uncompressed
Problem 4: Why can't 0.1 + 0.2 equal 0.3 exactly?
Show Solution
0.1 in binary is a repeating fraction:
0.0001100110011001100... (repeats forever)
Like 1/3 in decimal = 0.333... can't be exact,
0.1 in binary can't be stored exactly.
When you add two approximations:
≈0.1 + ≈0.2 ≈ 0.30000000000000004
The tiny errors compound.
Summary
You've now explored the fascinating world of data representation—how computers encode the infinite variety of information we work with into simple binary patterns.
- Integers are stored in fixed-size binary with attention to endianness
- Floating-point uses sign-exponent-mantissa format (IEEE 754)
- Text evolved from ASCII to Unicode, with UTF-8 as the web standard
- Images are grids of color values, compressed with various algorithms
- Audio samples analog waves at regular intervals
- Video combines image sequences with audio, heavily compressed
Understanding data representation helps you work with file formats, debug encoding issues, optimize storage, and appreciate the elegant simplicity underlying all digital media.