filter

 The filter operator produces a result array, filtering out elements based on a supplied boolean expression.

Synopsis

filter( array, expression [ , expr_cache: BOOLEAN ] );

Summary

The filter operator filters out data from an array based on an expression over the attribute and dimension values. The result array has the same schema as the supplied array. Each cell of the result array is either empty or identical to the corresponding cell of the supplied array according to the following rules. If the expression at the cell evaluates to: 

  • TRUE, the result cell is identical to the supplied cell.
  • FALSE, the result cell is empty.
  • NULL, the result cell is empty.

The expr_cache: option can be used for manual performance tuning, described further below.

Examples

Filter an Array to Remove Outlying Values

To filter an array to remove outlying values, do the following:

  1. Create a 4×4 array:

    AFL% CREATE ARRAY m4x4<val:double>[i=0:3; j=0:3];
  2. Put values between 0 and 15 into the non-diagonal elements of m4x4 and values greater than 100 into the diagonal elements:

    AFL% store(build(m4x4,iif(i=j,100+i,i*4+j)),m4x4); 


    The output is:

    [
    [(100),(1),(2),(3)],
    [(4),(101),(6),(7)],
    [(8),(9),(102),(11)],
    [(12),(13),(14),(103)]
    ] 
  3. Filter all values of 100 or greater out of m4x4:

    AFL% filter(m4x4,val<100);  


    The output is:

    [
    [(),(1),(2),(3)],
    [(4),(),(6),(7)],
    [(8),(9),(),(11)],
    [(12),(13),(14),()]
    ] 

Use a Regular Expression to Filter the List of Available SciDB Operators

To use a regular expression to filter the list of available SciDB operators, and return the operators containing
the letter 'q', do the following:

AFL% filter(list('operators'),regex(name,'(.*)q(.*)')); 


The output is:

name,library
'quantile','scidb'
'uniq','scidb' 

Hint on Filtering by Dimension

Since 2017, SciDB's filter() operator is significantly improved over previous releases, when the expression involves dimension ranges. Query processing will focus on data in the dimension ranges. So the speedup over previous releases is more if the specified dimension ranges cover a smaller portion of the space.

In order to receive full benefit of the optimization, structure your expression in the form of ((dimension_conjunctions or ...) and other_conditions). For instance, let your array have two dimensions x and y; in order to filter for records in the first or the third quadrants, and with some other conditions, write your expression as: (((x>0 and y>0) or (x<0 and y<0)) and other_conditions).

Performance Tuning with the expr_cache: Option

When the SciDB execution engine pulls attributes through a filter operator, it can either re-evaluate the filter expression for each chunk or tile of the attribute, or it can cache the result of the expression evaluation for the first attribute and refer to the cache when pulling all other attributes.  The cost of evaluating a filter() operation depends on the choice of expression evaluation strategy, the nature of the filter expression, and the number of attributes in the input array.  In the current release, the query optimizer is not yet capable of making the best strategy choice for all queries.  The default behavior is to evaluate expressions per-attribute for input arrays with at most filter-cache-attr-threshold attributes (a configuration parameter), and to use cached expression evaluation for input arrays with more attributes.

The expr_cache: option forces the choice of filter expression evaluation strategy on a per-operator basis, regardless of the number of input attributes.   If set to true, expression evaluations will be cached.  If set to false, the expression will be re-evaluated as each attribute is pulled.  If unspecified, the behavior is as the default described above.