World Library  
Flag as Inappropriate
Email this Article

Streaming algorithm

Article Id: WHEBN0021759411
Reproduction Date:

Title: Streaming algorithm  
Author: World Heritage Encyclopedia
Language: English
Subject: Sketch, I2P, Out-of-core algorithm, Stream (computing), Stream processing
Collection:
Publisher: World Heritage Encyclopedia
Publication
Date:
 

Streaming algorithm

In computer science, streaming algorithms are algorithms for processing data streams in which the input is presented as a sequence of items and can be examined in only a few passes (typically just one). These algorithms have limited memory available to them (much less than the input size) and also limited processing time per item.

These constraints may mean that an algorithm produces an approximate answer based on a summary or "sketch" of the data stream in memory.

History

Though streaming algorithms had already been studied by Munro and Paterson[1] as well as Flajolet and Martin,[2] the field of streaming algorithms was first formalized and popularized in a paper by Noga Alon, Yossi Matias, and Mario Szegedy.[3] For this paper, the authors later won the Gödel Prize in 2005 "for their foundational contribution to streaming algorithms." There has since been a large body of work centered around data streaming algorithms that spans a diverse spectrum of computer science fields such as theory, databases, networking, and natural language processing.

Semi-streaming algorithms were introduced in 2005 as an extension of streaming algorithms that allows for a constant or logarithmic number of passes over the dataset [1].

Models

In the data stream model, some or all of the input data that are to be operated on are not available for random access from disk or memory, but rather arrive as one or more continuous data streams.

Streams can be denoted as an ordered sequence of points (or "updates") that must be accessed in order and can be read only once or a small number of times.

Much of the streaming literature is concerned with computing statistics on frequency distributions that are too large to be stored. For this class of problems, there is a vector \mathbf{a} = (a_1, \dots, a_n) (initialized to the zero vector \mathbf{0}) that has updates presented to it in a stream. The goal of these algorithms is to compute functions of \mathbf{a} using considerably less space than it would take to represent \mathbf{a} precisely. There are two common models for updating such streams, called the "cash register" and "turnstile" models.[4]

In the cash register model each update is of the form \langle i, c\rangle, so that a_i is incremented by some positive integer c. A notable special case is when c = 1 (only unit insertions are permitted).

In the turnstile model each update is of the form \langle i, c\rangle, so that a_i is incremented by some (possibly negative) integer c. In the "strict turnstile" model, no a_i at any time may be less than zero.

Several papers also consider the "sliding window" model. In this model, the function of interest is computing over a fixed-size window in the stream. As the stream progresses, items from the end of the window are removed from consideration while new items from the stream take their place.

Besides the above frequency-based problems, some other types of problems have also been studied. Many graph problems are solved in the setting where the adjacency matrix or the adjacency list of the graph is streamed in some unknown order. There are also some problems that are very dependent on the order of the stream (i.e., asymmetric functions), such as counting the number of inversions in a stream and finding the longest increasing subsequence.

Evaluation

The performance of an algorithm that operates on data streams is measured by three basic factors:

  • The number of passes the algorithm must make over the stream.
  • The available memory.
  • The running time of the algorithm.

These algorithms have many similarities with online algorithms since they both require decisions to be made before all data are available, but they are not identical. Data stream algorithms only have limited memory available but they may be able to defer action until a group of points arrive, while online algorithms are required to take action as soon as each point arrives.

If the algorithm is an approximation algorithm then the accuracy of the answer is another key factor. The accuracy is often stated as an (\epsilon,\delta) approximation meaning that the algorithm achieves an error of less than \epsilon with probability 1-\delta.

Applications

Streaming algorithms have several applications in networking such as monitoring network links for elephant flows, counting the number of distinct flows, estimating the distribution of flow sizes, and so on.[5] They also have applications in databases, such as estimating the size of a join .

Some streaming problems

Frequency moments

The kth frequency moment of a set of frequencies \mathbf{a} is defined as F_k(\mathbf{a}) = \sum_{i=1}^n a_i^k.

The first moment F_1 is simply the sum of the frequencies (i.e., the total count). The second moment F_2 is useful for computing statistical properties of the data, such as the Gini coefficient of variation. F_{\infty} is defined as the frequency of the most frequent item(s).

The seminal paper of Alon, Matias, and Szegedy dealt with the problem of estimating the frequency moments.

Heavy hitters

Find all elements i whose frequency a_i > T, say. Some notable algorithms are:

  • Karp-Papadimitriou-Shenker algorithm
  • Count-Min sketch
  • Sticky sampling
  • Lossy counting
  • Sample and Hold
  • Multi-stage Bloom filters
  • Count-sketch
  • Sketch-guided sampling

Counting distinct elements

Counting the number of distinct elements in a stream (sometimes called the F_0 moment) is another problem that has been well studied. The first algorithm for it was proposed by Flajolet and Martin. In 2010, D. Kane, J. Nelson and D. Woodruff found an asymptotically optimal algorithm for this problem.[6] It uses O(ε^2 + log d) space, with O(1) worst-case update and reporting times, as well as universal hash functions and a r-wise independent hash family where r = Ω(log(1/ε)/ log log(1/ε)) ..

Entropy

The (empirical) entropy of a set of frequencies \mathbf{a} is defined as F_k(\mathbf{a}) = \sum_{i=1}^n \frac{a_i}{m}\log{\frac{a_i}{m}}, where m = \sum_{i=1}^n a_i.

Estimation of this quantity in a stream has been done by:

  • McGregor et al.
  • Do Ba et al.
  • Lall et al.
  • Chakrabarti et al.

Online learning

Learn a model (e.g. a classifier) by a single pass over a training set.


Lower bounds

Lower bounds have been computed for many of the data streaming problems that have been studied. By far, the most common technique for computing these lower bounds has been using communication complexity.

See also

Notes

References

  • . First published as .
  • .
  • .
  • .
  • .
  • .
  • .

External links

  • Princeton Lecture Notes
  • Piotr Indyk, MIT
  • Dagstuhl Workshop on Sublinear Algorithms
  • IIT Kanpur Workshop on Data Streaming
  • List of open problems in streaming (compiled by Andrew McGregor) from discussion at the IITK Workshop on Algorithms for Data Streams, 2006.
  • StreamIt - programming language and compilation infrastructure by MIT CSAIL
  • IBM Spade - Stream Processing Application Declarative Engine
  • IBM InfoSphere Streams
Tutorials and surveys
  • Data Stream Algorithms and Applications by S. Muthu Muthukrishnan
  • Stanford STREAM project survey
  • Network Applications of Bloom filters, by Broder and Mitzenmacher
  • Xu's SIGMETRICS 2007 tutorial
  • Lecture notes from Data Streams course at Barbados in 2009, by Andrew McGregor and S. Muthu Muthukrishnan
Courses
  • Dartmouth
  • MIT
  • Rice
  • Rutgers
  • University at Buffalo
This article was sourced from Creative Commons Attribution-ShareAlike License; additional terms may apply. World Heritage Encyclopedia content is assembled from numerous content providers, Open Access Publishing, and in compliance with The Fair Access to Science and Technology Research Act (FASTR), Wikimedia Foundation, Inc., Public Library of Science, The Encyclopedia of Life, Open Book Publishers (OBP), PubMed, U.S. National Library of Medicine, National Center for Biotechnology Information, U.S. National Library of Medicine, National Institutes of Health (NIH), U.S. Department of Health & Human Services, and USA.gov, which sources content from all federal, state, local, tribal, and territorial government publication portals (.gov, .mil, .edu). Funding for USA.gov and content contributors is made possible from the U.S. Congress, E-Government Act of 2002.
 
Crowd sourced content that is contributed to World Heritage Encyclopedia is peer reviewed and edited by our editorial staff to ensure quality scholarly research articles.
 
By using this site, you agree to the Terms of Use and Privacy Policy. World Heritage Encyclopedia™ is a registered trademark of the World Public Library Association, a non-profit organization.
 


Copyright © World Library Foundation. All rights reserved. eBooks from Project Gutenberg are sponsored by the World Library Foundation,
a 501c(4) Member's Support Non-Profit Organization, and is NOT affiliated with any governmental agency or department.