Approximating data aggregations using stochastic streaming algorithms

A new approach for approximate aggregations for huge data volumes based on stochastic streaming algorithms.

January 28, 2019

Aggregating big data volumes requires expensive computing resources and considerable time to generate exact results. Examples include count distinct, quantiles, most frequent items, joins, matrix computations and graph analysis.

Many times, approximate results are acceptable, especially if the error coefficient is deterministic. There are ways to get much faster (by several orders of magnitude) to the desired result, but with less accuracy. However, having a deterministic, mathematically proven error bounds, we could extrapolate the approximation to the final result. If speed is more important than 100% accuracy, you can use a special class of algorithms called stochastic streaming algorithms (or more simple - Sketches). This class of algorithms is particularly useful for interactive data analysis and real-time processing.

Sketching is a relatively recent development in the theoretical field of Stochastic Streaming Algorithms, which deals with algorithms that can extract information from a stream of data in a single pass (sometimes called “one-touch” processing) using various randomization techniques.

Sketching is a synergistic blend of theoretical mathematics, statistics and computer science, refers to a broad range of algorithms, and has experienced a great deal of interest and growth since the mid 1990’s coinciding with the growth of the Internet and the need to process and analyze Big Data. The term sketch, with its allusion to an artist’s sketch, has become the popular term to describe these algorithms and associated data structures that implement the theory.

There are 7 types of types of Sketches and each type can be applied to a specific problem:

  1. Theta Sketches for Estimating Stream Expression Cardinalities
  2. HyperLogLog Sketches: Estimating Stream Cardinalities
  3. HyperLogLog Map Sketch: Estimating Stream Cardinalities of Key-Value Pairs
  4. Quantiles Sketches: Estimating Distributions from a Stream of Values
  5. Frequent Items Sketches: Finding the Heavy Hitter Objects from a Stream
  6. Tuple Sketches: Extending Theta Sketches to Perform Associative Analysis
  7. Sampling Sketches: Uniform Sampling of a Stream into a fixed size space

BigConnect allows sketches to be stored on Entities and Relationships. These sketches can be continually updated as new data arrives, or can be used in queries, and have numerous applications like:

Sketches can be updated whenever data is loaded, updated or removed by creating a simple DataWorker that updates the sketches stored on affected nodes accordingly.

At query time, sketches can be used as aggregation functions simply by adding custom Gremlin aggregators or Cypher special functions. Practical examples will be given in later blog posts, but should not be very difficult to implement once you know the basics of BigConnect.

Read more on how to extend BigConnect on docs.bigconnect.io

Back to Blog