A new approach for approximate aggregations for huge data volumes based on stochastic streaming algorithms.

January 28, 2019

Aggregating big data volumes requires expensive computing resources and considerable time to generate exact results. Examples include *count distinct*, quantiles, most frequent items, joins, matrix computations and graph analysis.

Many times, approximate results are acceptable, especially if the error coefficient is deterministic. There are ways to get much faster (**by several orders of magnitude**) to the desired result, but with less accuracy. However, having a deterministic, mathematically proven error bounds, we could extrapolate the approximation to the final result. If speed is more important than 100% accuracy, you can use a special class of algorithms called **stochastic streaming algorithms (or more simple - Sketches)**. This class of algorithms is particularly useful for interactive data analysis and real-time processing.

Sketching is a relatively recent development in the theoretical field of *Stochastic Streaming Algorithms*, which deals with algorithms that can extract information from a stream of data in a single pass (sometimes called “one-touch” processing) using various randomization techniques.

Sketching is a synergistic blend of theoretical mathematics, statistics and computer science, refers to a broad range of algorithms, and has experienced a great deal of interest and growth since the mid 1990’s coinciding with the growth of the Internet and the need to process and analyze Big Data. The term *sketch*, with its allusion to an artist’s sketch, has become the popular term to describe these algorithms and associated data structures that implement the theory.

There are 7 types of types of Sketches and each type can be applied to a specific problem:

- Theta Sketches for Estimating Stream Expression Cardinalities
- HyperLogLog Sketches: Estimating Stream Cardinalities
- HyperLogLog Map Sketch: Estimating Stream Cardinalities of Key-Value Pairs
- Quantiles Sketches: Estimating Distributions from a Stream of Values
- Frequent Items Sketches: Finding the Heavy Hitter Objects from a Stream
- Tuple Sketches: Extending Theta Sketches to Perform Associative Analysis
- Sampling Sketches: Uniform Sampling of a Stream into a fixed size space

BigConnect allows sketches to be stored on Entities and Relationships. These sketches can be continually updated as new data arrives, or can be used in queries, and have numerous applications like:

- Using a HyperLogLogPlusPlus sketch to provide a very quick estimate of the degree of a node
- Using a Quantiles sketch to estimate the median score associated to an edge, or the 99th percentile of the scores seen on an edge
- Using a Reservoir Items sketch to store a sample of all the distinct labels associated to an edge
- Using Theta Sketches to estimate the number of distinct edges seen on a particular day, the number seen on the previous day and the overlap between the two days

Sketches can be updated whenever data is loaded, updated or removed by creating a simple DataWorker that updates the sketches stored on affected nodes accordingly.

At query time, sketches can be used as aggregation functions simply by adding custom Gremlin aggregators or Cypher special functions. Practical examples will be given in later blog posts, but should not be very difficult to implement once you know the basics of BigConnect.

Read more on how to extend BigConnect on docs.bigconnect.io