Reservoir Sampling: Random Sampling from an Endless Data Stream

by Kiki

Modern data often arrives as a stream: click events, sensor readings, app logs, payment transactions, and IoT telemetry. In many real systems, you cannot pause the stream, load everything into memory, and then sample it. You may not even know how many records will arrive by the end of the day. Reservoir sampling solves this exact problem: it lets you pick a simple random sample from a stream of unknown size using a small, fixed amount of memory. If you are learning streaming analytics in a data science course in Ahmedabad, reservoir sampling is one of the cleanest examples of how probability and engineering meet in practice.

Why Sampling from Streams Is Hard

Sampling from a fixed dataset is straightforward: if you have all rows in a table, you can shuffle them or use random indices. Streams change the rules:

  • Unknown total size (N): You do not know how many items will appear.
  • Memory limits: Storing every item is often impossible or too expensive.
  • One-pass processing: Many pipelines process each record once and move on.

A naive approach like “take the first k items” is biased toward early records. Another naive approach like “randomly keep some items” can lead to sample sizes that vary unpredictably. Reservoir sampling provides a mathematically correct way to keep exactly k items, uniformly sampled over everything seen so far.

The Core Algorithm (Reservoir of Size k)

The idea is simple: maintain a “reservoir” of k items while reading the stream.

  1. Fill the reservoir: Put the first k items from the stream into the reservoir.
  2. Process the i-th item (i starts at k+1):
    • Generate a random integer j uniformly in the range [1, i].
    • If jk, replace the j-th item in the reservoir with the new item.
    • Otherwise, discard the new item.

That is it. At any point, the reservoir holds k items that form a simple random sample of all items seen so far—meaning every item has the same probability of being included.

This algorithm is often called Algorithm R (for “reservoir”). It is widely used because it is easy to implement, fast, and memory-efficient.

Why It Works (Correctness Intuition)

Reservoir sampling guarantees uniformity without knowing the final stream length. Here is the intuition.

  • When you first fill the reservoir with the first k items, each of those items is obviously in the reservoir.
  • When item i arrives, the algorithm selects it for the reservoir with probability k / i (because j must land in the first k positions out of i equally likely choices).
  • If the item is selected, it replaces one of the k existing items uniformly, so no position is favoured.

A key result is: after processing i items, every item among those i has probability k/i of being in the reservoir. That is exactly what you want for a uniform sample of fixed size k. This is an important concept to understand well in a data science course in Ahmedabad, especially if you plan to work with logs, real-time dashboards, or event-driven systems.

Practical Considerations and Common Pitfalls

Reservoir sampling is conceptually simple, but production use requires a few careful choices:

  • Random number quality: Use a reliable RNG from your language’s standard library. Poor randomness can introduce subtle bias.
  • Large streams: If i becomes very large, ensure your integer range and RNG calls remain correct (watch for overflow in some environments).
  • Sampling more than one item (k > 1): Reservoir sampling naturally supports this with the same steps. Memory stays O(k).
  • Weighted sampling: Sometimes you want more important items to be more likely included (for example, errors vs normal logs). That requires weighted reservoir sampling, which is a related but different method.
  • Distributed streams: If data is processed across partitions (multiple machines), you typically maintain partial reservoirs per partition and then merge carefully, or use streaming frameworks with built-in sampling primitives.

Real-World Use Cases

Reservoir sampling is useful anywhere you need a representative slice of a stream:

  • Monitoring and observability: Keep a random sample of requests to inspect latency, headers, or payload shapes without logging everything.
  • Data quality checks: Sample records from an ingestion pipeline to detect schema drift, missing fields, or unusual values.
  • Online A/B experimentation: Randomly retain a fixed number of user sessions for deeper analysis while keeping storage bounded.
  • Machine learning workflows: Maintain a “training preview” dataset from a firehose of events to quickly test feature logic.

Because it is one-pass and memory bounded, reservoir sampling fits well in streaming architectures. Many learners encounter it while building real-time analytics projects in a data science course in Ahmedabad, where handling continuous data is an increasingly common expectation.

Conclusion

Reservoir sampling is a family of randomized algorithms designed for a very practical challenge: selecting a uniform random sample from a stream when the total size is unknown. With only O(k) memory and one pass through data, it produces a correct simple random sample at any point in time. If your work involves event streams, monitoring, or scalable data pipelines, reservoir sampling is a dependable tool to know—and a strong foundational topic to master in a data science course in Ahmedabad.

Related Posts