builtin_equi_join
The builtin_equi_join operator joins two arrays on specified key-tuples from each array (ignoring NULL keys).
Differences from the plugin equi_join operator
The result array is a DataFrame.
Join-key attrs/dims must be specified by name rather than position.
The keyword parameters for specifying keys are "left_keys" and "right_keys",
changed from "left_names" and "right_names".Keys may be attributes or dimensions. Names in "left_keys" are interpreted
relative to the left schema only, and similarly for "right_keys" and the
right schema, so the same names may be used in left_keys and right_keys without
requiring array aliases to disambiguate.builtin_equi_join() implements the LogicalFlow API which allows
optimizations for filter pushdown and projection insertion to eliminate
as many cells and attributes as possible early in query evaluation.
This is especially useful for automatically eliminating unused attributes
from the builtin_equi_join's inputs.
Synopsis
builtin_equi_join(
left_array,
right_array,
left_keys:(left_keyA, left_keyB, ...),
right_keys:(right_keyA, right_keyB, ...)
[, <settings>]
);
Summary
The result array is a DataFrame.
The result has one cell for each pair (left, right) of matching cells where
the key values are non-NULL and all the left_keys are equal to the corresponding
right_keys.
In the result schema the join keys are placed first, their names are assigned from the left array, they are nullable if nullable in either of the inputs. This is followed by remaining left attributes, then non-key left dimensions if requested, then right attributes, then non-key right dimensions if requested.
Required Parameters
left_keys: a single attr/dim, or a bracketed list of attrs/dims from the left input
right_keys: a single attr/dim, or a bracketd list of attrs/dims from the right input
Optional Parameters
algorithm:name
: a hard override on how to perform the join, currently supported values are below; see next section for details
hash_replicate_left
: copy the entire left array to every instance and perform a hash joinhash_replicate_right
: copy the entire right array to every instance and perform a hash joinmerge_left_first
: redistribute the left array by hash first, then perform either merge or hash joinmerge_right_first
: redistribute the right array by hash first, then perform either merge or hash join
left_outer
:false/true controls whether a cell is produced for each left cell not matching any right cell
'right_outer':false/true constrols whether a cell is produced for each right cell not matching any left cell
'chunk_size:S`: specified chunk size for the result array
keep_dimensions:false/true
: result should keep (non-key) input dimensions, converted to attributes
default false.
hash_join_threshold:MB
: a threshold on the array size used to choose the algorithm; see next section for details; defaults to the merge-sort-buffer
config
bloom_filter_size:bits
: the size of the bloom filters to use, in units of bits; TBD: clean this up
Link to detailed description of algorithms
Examples
We'll start with a couple of small arrays:
$ iquery -aq "store(apply(build(<a:string>[i=0:5,2,0], '[(null),(def),(ghi),(jkl),(mno)]', true), b, double(i)*1.1), left)"
{i} a,b
{0} null,0
{1} 'def',1.1
{2} 'ghi',2.2
{3} 'jkl',3.3
{4} 'mno',4.4
$ iquery -aq "store(apply(build(<c:string>[j=1:5,3,0], '[(def),(mno),(null),(def)]', true), d, j), right)"
{j} c,d
{1} 'def',1
{2} 'mno',2
{3} null,3
{4} 'def',4
Join the arrays on the string attribute, i.e. left.a=right.c:
$ iquery -aq "builtin_equi_join(left, right, left_keys:a, right_keys:c)"
a,b,d
'def',1.1,4
'def',1.1,1
'mno',4.4,2
Note "left.a" and "right.c" are combined into a single attribute "a" (name inherited from left). The other two attributes are "left.b" and "right.d".
Perform left, right or full outer joins. Note the order of returned results may vary:
$ iquery -aq "builtin_equi_join(left, right, left_keys:a, right_keys:c, left_outer:true)"
a,b,d
'ghi',2.2,null
'jkl',3.3,null
null,0,null
'def',1.1,4
'def',1.1,1
'mno',4.4,2
$ iquery -aq "builtin_equi_join(left, right, left_keys:a, right_keys:c, right_outer:true)"
a,b,d
'def',1.1,4
'def',1.1,1
'mno',4.4,2
null,null,3
$ iquery -aq "builtin_equi_join(left, right, left_keys:a, right_keys:c, left_outer:true, right_outer:true)"
a,b,d
'mno',4.4,2
null,0,null
null,null,3
'def',1.1,4
'def',1.1,1
'ghi',2.2,null
'jkl',3.3,null
Join on two keys: left.i = right.d (dimension to attribute) and left.a=right.c.
Compared to the Existing cross_join
We make a large 2D dense array:
Here's how to use the current cross_join to extract a strip at x=128:
With equi-join, note the redimension is no longer necessary and the performance is similar:
Note that left_keys and right_keys interpret the key names with respect to their corresponding input array,
so we don't need to disambiguate the 'x' in left_keys and right_keys.
Here, builtin_equi_join
detects that the join is on dimensions and uses a chunk filter structure to prevent irrelevant chunks from being scanned. The above is a lucky case for cross_join
- as the number of attributes increases, the advantage of builtin_equi_join
gets bigger. If the join is on attributes, cross_join
definitely cannot keep up. Moreover cross_join
always replicates the right array, no matter how large, to every instance; this is often disastrous. builtin_equi_join
will adapt well regardless of the order of arguments, in most cases.
One disadvantage at the moment is that builtin_equi_join
is fully materializing. In a scenario such as:
The inside operation is evaluated first, the output is materialized and then the outer operation is evaluated. This isn't always optimal. And in some cases nested cross_join
can be faster. Hoping to make this more optimal soon.
At the moment, cross_join
remains useful for the Full Cartesian Product use case, i.e.:
In cases like this, the inputs are usually small (because the output squares them). Here, preserving dimensionality is often useful and the approach of replicating one of the arrays is about the best one can do.
Usage
Where left and right array could be any SciDB arrays or operator outputs.
Specifying join-on fields (keys)
left_keys:(a,b,c)
: comma-separated dimension or attribute names from the left arrayright_keys:(d,e,f)
: comma-separated dimension or attribute names from the right array
There must be an equal number of left and right keys and corresponding keys must have matching data types; dimensions are int64.
When joining on identically named keys
Outer joins
left_outer:false/true
: if set totrue
or1
perform a left outer join: include all cells from the left array and populate with nulls where there are no corresponding cells in the right arrayright_outer:false/true
: if set totrue
or1
perform a right outer join: include all cells from the right array and populate with nulls where there are no corresponding cells in the left array
By default, both of these are set to false. Setting both left_outer:true
and right_outer:true
will result in a full outer join.
Additional filter on the output:
filter:expression
can be used to apply an additional filter to the result.
Use any valid boolean expression over the output attributes. For example:
Note, builtin_equi_join(..., 'filter:expression')
is equivalent to filter(builtin_equi_join(...), expression)
except the operator is materializing and the former will apply filtering prior to materialization. This is an efficiency improvement in cases where the join on keys increases the size of the data before filtering.
Other settings:
chunk_size:S
: for the outputkeep_dimensions:false/true
:true
if the output should contain all the input dimensions, converted to attributes. 0 is default, meaning dimensions are only retained if they are join keys.hash_join_threshold:MB
: a threshold on the array size used to choose the algorithm; see next section for details; defaults to themerge-sort-buffer
configbloom_filter_size:bits
: the size of the bloom filters to use, in units of bits; TBD: clean this upalgorithm:name
: a hard override on how to perform the join, currently supported values are below; see next section for detailshash_replicate_left
: copy the entire left array to every instance and perform a hash joinhash_replicate_right
: copy the entire right array to every instance and perform a hash joinmerge_left_first
: redistribute the left array by hash first, then perform either merge or hash joinmerge_right_first
: redistribute the right array by hash first, then perform either merge or hash join
Result
Each array cell on the left is associated with 0 or more array cells on the right IFF all specified keys are equal respectively: left_cell.key1 = right_cell.key1 AND left_cell.key2=right_cell.key2 AND ...
For inner joins, the output will contain one cell for each such association using all attributes from both arrays, plus dimensions if requested. Outer joins will include all cells from the input(s) as specified and use NULLs when a matching cell cannot be found in the opposite array. A cell where any of the join-on keys are NULL will not be associated with any tuples from the opposite array; so these cells will not be present unless the join is outer. The order of the returned result is indeterminate and will vary with algorithm and number of instances.
Result Schema
The result is returned as:<join_key_0:type [NULL], join_key_1:.., left_att_0:type, left_att_1,... right_att_0...> [instance_id, value_no]
Note the join keys are placed first, their names are assigned from the left array, they are nullable if nullable in either of the inputs. This is followed by remaining left attributes, then left dimensions if requested, then right attributes, then right dimensions.
Note that the result is "flattened" along [instance_id, value_no]
in a matter similar to operators like grouped_aggregate
, sort
, stream
and so on. Depending on the join keys, input chunking may or may not be easy to preserve and it may take extra work to preserve it. The needs to perform any redimensioning manually, post join.
Algorithms
The operator first estimates the lower bound sizes of the two input arrays and then, absent a user override, picks an algorithm based on those sizes.
Size Estimation
It is easy to determine if an input array is materialized (leaf of a query or output of a materializing operator). If this is the case, the exact size of the array can be determined very quickly (O of number of chunks with no disk scans). Otherwise, the operator initiates a pre-scan of just the Empty Tag attribute to find the number of non-empty cells (count) in the array. The count, multiplied by the attribute sizes is used to estimate total size. The pre-scan continues until either end of array (at the local instance), or the estimated size reaching hash_join_threshold
. Thus we ensure the pre-scan does not take too long. The per-instance pre-scan results then gathered together with one round of message exchange between instances.
Replicate and Hash
If it is determined (or user-dictated) that one of the arrays is small enough to fit in memory on every instance, then that array is copied entirely to every instance and loaded into an in-memory hash table. The table is used to assemble a filter over the chunk positions in the other array. The other array is then read, using the filter to prevent disk scans for irrelevant chunks. Chunks that make it through the filter are joined using the hash table lookup.
Merge
If both arrays are sufficiently large, the smaller array's join keys are hashed and the hash is used to redistribute it such that each instance gets roughly an equal portion. Concurrently, a filter over chunk positions and a bloom filter over the join keys are built. The chunk and bloom filters are copied to every instance. The second array is then read - using the filters to eliminate unnecessary chunks and values - and redistributed along the same hash, ensuring co-location. Now that both arrays are colocated and their exact sizes are known, the algorithm may decide to read one of them into a hash table (if small enough) or sort both and join via a pass over two sorted sets.