Posts

Showing posts with the label filters

Using Bloom Filters to Lower Cost of Large Join Jobs

Using Bloom Filters to Lower Cost of Large Join Jobs Data management company <a href=�http://liveramp.com/�> LiveRamp</a> recently began opensourcing some of their internal data analysis and management tools. In this process they added a new tool for reducing the cost of MapReduce join jobs, BloomJoin. BloomJoin is useful when you are trying to join two groups where one is a very large dataset and the other is significantly smaller with a significantly smaller proportion of the data from the larger set. To complete this job normally, a user would first sort both sets of data, and then reduce both sets. This works fairly well but is inefficient with regards to sorting the larger dataset. To alleviate this BloomJoin first applies a bloom filter based on the target dataset to the larger dataset. A bloom filter is a probabilistic representation of datasets. By giving the filter a target set of objects it rejects objects that are not found within the target. Bloom filters hash t...