Window Function in SQL is an OLAP functionality that provides ranking, cumulative computation, and partitioning aggregation. Many commercial RDMBS such like Oracle, MS SQL Server and DB2 have implemented part of this specification, while open source RDMBS including PostgreSQL, MySQL and Firebird doesn't yet. To implement this functionality on PostgreSQL not only helps many users move from those RDBMS to PostgreSQL but encourages OLAP applications such as BI (Business Inteligence) to analyze large data set.
The first proposal: http://archives.postgresql.org/pgsql-hackers/2008-06/msg00380.php
The subsequent discussion: http://archives.postgresql.org/pgsql-hackers/2008-07/msg00232.php
patch v05: http://umitanuki.net/pgsql/wfv05/window_functions.patch.20080917
patch v05 applied source git: http://git.postgresql.org/git/~davidfetter/window_functions/.git
sample SQL: http://umitanuki.net/pgsql/wfv05/sample.sql
Below is a description of how it is designed in the patch so far.
Below are dropped features for 8.4.
The first and second topics are difficult to implement currently. Because these features require random row access, it seems that tuplestore would be able to save multiple positions to mark/restore. This is fundamental change that is over my capability. Also, user defined window functions seem to have much more to decide. I think I can't put into shape the general needs of user's window functions now. Lacking these feature, this stage looks compatible to SQLServer 2005, while Oracle and DB2 have almost full of the specification.
When posted and discussed in -hackers list, a bit confusing was about terminology. So be aware of these definitions.
An expression evaluated in a Window node, which is one of rank function, aggregate function, ntile function, lead or lag function, first or last value function, or nth value function. In a Window node, only TargetEntry that has window expression is evaluated, while other entries are evaluated some outer (scans, joins, aggs) node.
This type of function returns different or the same values row by row. Since this function needs to know and operate "current window", we will need to add new mechanism to PostgreSQL. This includes new spec function such as ROW_NUMBER(), RANK(), DENSE_RANK(), LEAD(), LAG(), etc.
The rest part of window expression. This type of function scans tuples along the specified window frame, then returns the same values as long as the window frame doesn't slide. We can use aggregate function we already have and there's no need to add/introduce anything new.
This is a normal aggregate that PostgreSQL has already. "Normal" means "not windowed". In some SQL spec documents, they call it "group aggregate".
It indicates a window frame, which is represented in SQL syntax by "ROW BETWEEN...", "RANGE BETWEEN...", "CURRENT ROW...", etc. This range slides row by row in a partitoned window, thus we need to introduce some mechanism to optimize not to allocate wasting memory.
The sample table is like this.
sample=# SELECT * FROM empsalary;
depname | empno | salary | enroll_date -----------+-------+--------+------------- develop | 10 | 5200 | 2007-08-01 sales | 1 | 5000 | 2006-10-01 personnel | 5 | 3500 | 2007-12-10 sales | 4 | 4800 | 2007-08-08 sales | 6 | 5500 | 2007-01-02 personnel | 2 | 3900 | 2006-12-23 develop | 7 | 4200 | 2008-01-01 develop | 9 | 4500 | 2008-01-01 sales | 3 | 4800 | 2007-08-01 develop | 8 | 6000 | 2006-10-01 develop | 11 | 5200 | 2007-08-15 (11 rows)
Now let's throw a windowed query.
sample=# SELECT sample-# depname, sample-# empno, sample-# salary, sample-# sum(salary) OVER (PARTITION BY depname) sample-# FROM sample-# empsalary;
depname | empno | salary | sum -----------+-------+--------+------- develop | 10 | 5200 | 25100 develop | 7 | 4200 | 25100 develop | 9 | 4500 | 25100 develop | 8 | 6000 | 25100 develop | 11 | 5200 | 25100 personnel | 2 | 3900 | 7400 personnel | 5 | 3500 | 7400 sales | 3 | 4800 | 20100 sales | 1 | 5000 | 20100 sales | 4 | 4800 | 20100 sales | 6 | 5500 | 20100 (11 rows)
You may see dep_sum is the result of SUM() for each depname, and year_sum is the result of SUM() for each enrolling year, without rows aggregated.
The ranking function of window function works as:
sample=# SELECT sample-# depname, sample-# empno, sample-# salary, sample-# rank() OVER (PARTITION BY depname ORDER BY salary) sample-# FROM sample-# empsalary;
depname | empno | salary | rank -----------+-------+--------+------ develop | 7 | 4200 | 1 develop | 9 | 4500 | 2 develop | 10 | 5200 | 3 develop | 11 | 5200 | 3 develop | 8 | 6000 | 5 personnel | 5 | 3500 | 1 personnel | 2 | 3900 | 2 sales | 4 | 4800 | 1 sales | 3 | 4800 | 1 sales | 1 | 5000 | 3 sales | 6 | 5500 | 4 (11 rows)
Another example shows a use in combination with GROUP BY clause.
sample=# SELECT sample=# y, sample=# m, sample=# SUM(SUM(people)) OVER (PARTITION BY y ORDER BY m), sample=# AVG(people) sample=# FROM( sample=# SELECT sample=# EXTRACT(YEAR FROM accident_date) AS y, sample=# EXTRACT(MONTH FROM accident_date) AS m, sample=# * sample=# FROM sample=# accident sample=# )s sample=# GROUP BY y, m;
y | m | sum | avg ------+----+------+-------------------- 2005 | 1 | 1698 | 3.5161290322580645 2005 | 2 | 1698 | 4.8928571428571429 2005 | 3 | 1698 | 4.3870967741935484 2005 | 4 | 1698 | 4.7333333333333333 2005 | 5 | 1698 | 5.0967741935483871 2005 | 6 | 1698 | 5.2666666666666667 2005 | 7 | 1698 | 4.8709677419354839 2005 | 8 | 1698 | 4.7419354838709677 2005 | 9 | 1698 | 4.8000000000000000 2005 | 10 | 1698 | 4.8709677419354839 2005 | 11 | 1698 | 4.1333333333333333 2005 | 12 | 1698 | 4.5483870967741935 2006 | 1 | 1740 | 4.3870967741935484 2006 | 2 | 1740 | 4.5000000000000000 2006 | 3 | 1740 | 4.8387096774193548 2006 | 4 | 1740 | 5.0333333333333333 2006 | 5 | 1740 | 4.4838709677419355 2006 | 6 | 1740 | 4.1333333333333333 2006 | 7 | 1740 | 5.1935483870967742 2006 | 8 | 1740 | 4.7419354838709677 2006 | 9 | 1740 | 3.8333333333333333 2006 | 10 | 1740 | 6.2258064516129032 2006 | 11 | 1740 | 4.4333333333333333 2006 | 12 | 1740 | 5.3225806451612903 (24 rows)
You can put any expressions as window function's arguments or PARTITION BY/ORDER BY clause as long as it satisfies condition that normal aggregate requires.
Now WINDOW clause is shown.
sample=# SELECT depname, empno, salary, sum(salary) OVER w FROM empsalary WINDOW w AS (PARTITION BY depname);
depname | empno | salary | sum -----------+-------+--------+------- develop | 11 | 5200 | 25100 develop | 7 | 4200 | 25100 develop | 9 | 4500 | 25100 develop | 8 | 6000 | 25100 develop | 10 | 5200 | 25100 personnel | 5 | 3500 | 7400 personnel | 2 | 3900 | 7400 sales | 3 | 4800 | 14600 sales | 1 | 5000 | 14600 sales | 4 | 4800 | 14600 (10 rows)
Note that a window definition which is not referred from any function is ignored.
All above is defined in nodeWindow.c temporarily.
Some of them doesn't have trans function for optimization, which means opr_sanity check fails.
EXPLAIN SELECT sum(salary) OVER (PARTITION BY depname) AS dep_sum ,sum(salary) OVER (PARTITION BY extract(YEAR FROM enroll_date)) AS year_sum ,* FROM empsalary;
QUERY PLAN
-----------------------------------------------------------------------------------------
Window (cost=127.23..129.83 rows=1040 width=48)
-> Sort (cost=127.23..129.83 rows=1040 width=48)
Sort Key: (date_part('year'::text, (enroll_date)::timestamp without time zone))
-> Window (cost=72.52..75.12 rows=1040 width=48)
-> Sort (cost=72.52..75.12 rows=1040 width=48)
Sort Key: depname
-> Seq Scan on empsalary (cost=0.00..20.40 rows=1040 width=48)
This plan is quite ugly, because for each window a Window node is implicitly added with a Sort node. Probably all of window and sort process is packed into a Window node. For this current plan, Sort node uses Tuplesort as you expect then Window node uses Tuplestore to store each Partition tuples. This is supposed to be the worst plan. We are able to get it better somehow.
These shown below are ideas about how the window function is made up.
CREATE AGGREGATE window_func() ( sfunc = ... stype = ... wfunc = ... initcond = ) For each row we would execute the transition function (sfunc) then, if there is a window function (wfunc) then we call that to return a value for this tuple (so in that case we execute two functions per tuple in the window). If wfunc is not set then we return the transition datatype itself. http://archives.postgresql.org/pgsql-hackers/2008-07/msg00236.php
Objection: A window aggregate is same as a grouping aggregate. Also, some of window functions need full scan of rows *before* returning values.
So that would mean we don't provide a mechanism for user-defined windowed aggregate functions at all. Which solves the discussion about how to pass generic info through to them (at least long enough to get the first implementation done). http://archives.postgresql.org/pgsql-hackers/2008-07/msg00239.php
Objection: As mentioned, it hides the definition of functions from external user so that implementation is easier. However, it is odd as other function types is extensible and SQL spec may add more functions later. Some unification seems need.
Just idea, how about pass window object to a function? We'll provide
window operation API then in the function you take window object
through fcinfo:
Datum func(PG_FUNCTION_ARGS)
{
Datum v;
WindowObject w = get_window(fcinfo);
HeapTuple htup_current = window_current_row(w);
HeapTuple htup_prev = window_preceding(w, 1);
/* do something */
PG_RETURN_DATUM(v);
}
http://archives.postgresql.org/pgsql-hackers/2008-07/msg00254.php
Objection: You should consider about the performance. Some optimization mechanism is required.
And currently, the actual design inside the patch is as:
The rough process of normal aggregate function is described as:
trans_value = initialize_aggregate() for input_rows trans_value = advance_aggregate(trans_value, input_row) result = finalize_aggregate(trans_value)
while window function is described as:
while window_frame
if frame_is_new
trans_value = initialize_aggregate();
for input_rows
if agg_has_trans_fn
trans_value = advance_aggregate(trans_value, input_row)
result = finalize_aggregate(trans_value)
preserved_pointer = fcinfo->flinfo->fn_extra
This code means final function is called multiple times so that multiple value is returned after scanning all the frame rows. For ranking system, how to know its boundary is a bit kludge using fcinfo->context. For more detail about ranking system, see nodeWindow.c.
More valuable discussion about the design starts here: http://archives.postgresql.org/pgsql-hackers/2008-09/msg00021.php
It seems that we must add something like Window object mechanism that represents a window frame, to describe logical window. At the moment there needs to be careful not to cut its performance.
test0 test1 test2 test3 test4 test5 ------------------------------------------------------------ 689.502 416.633 257.970 1195.294 954.318 1204.292 687.254 447.676 256.629 1075.342 949.711 1154.754 700.602 421.818 260.742 1105.680 926.462 1203.012 736.594 476.388 334.310 1157.818 978.861 1199.944 676.572 418.782 270.270 1060.900 909.474 1175.079 687.260 428.564 257.032 1069.013 1045.387 1275.988 700.252 429.289 263.216 1074.749 1018.968 1273.965 719.478 445.218 258.464 1087.932 1015.744 1273.637 694.865 453.737 261.286 1065.229 1039.941 1262.208 685.756 430.169 258.017 1124.795 1102.055 1297.603 ------------------------------------------------------------ 697.81 436.83 267.79 1101.68 994.09 1232.05 test0 SELECT sum(amount) OVER (PARTITION BY sector) FROM bench1; test1 SELECT amount FROM bench1 ORDER BY sector; test2 SELECT sum(amount) FROM bench1 GROUP BY sector; test3 SELECT id, amount - avg(amount) OVER (PARTITION BY sector) FROM bench1; test4 SELECT id, amount - avg FROM bench1 INNER JOIN(SELECT sector, avg(amount) FROM bench1 GROUP BY sector)t USING(sector) test5 SET enable_hashagg TO off; SELECT id, amount - avg FROM bench1 INNER JOIN(SELECT sector, avg(amount) FROM bench1 GROUP BY sector)t USING(sector)
It says the current window function is faster than sort-operated self-join and slower than hashagg-operated self-join.
use strict;
use warnings;
use File::Temp qw(tempfile);
my $home = '/usr/local/postgresql-dev';
my $psql = "$home/bin/psql -p 35432";
my $dbname = 'sample';
my @tests;
push @tests, <<_SQL;
SELECT sum(amount) OVER (PARTITION BY sector) FROM bench1;
_SQL
push @tests, <<_SQL;
SELECT amount FROM bench1 ORDER BY sector;
_SQL
push @tests, <<_SQL;
SELECT sum(amount) FROM bench1 GROUP BY sector;
_SQL
push @tests, <<_SQL;
SELECT id, amount - avg(amount) OVER (PARTITION BY sector) FROM bench1;
_SQL
push @tests, <<_SQL;
SELECT id, amount - avg FROM bench1 INNER JOIN(SELECT sector, avg(amount) FROM bench1 GROUP BY sector)t USING(sector)
_SQL
push @tests, <<_SQL;
SET enable_hashagg TO off; SELECT id, amount - avg FROM bench1 INNER JOIN(SELECT sector, avg(amount) FROM bench1 GROUP BY sector)t USING(sector)
_SQL
my @total_elapse;
my $rows = 100000;
my $ntest = 10;
&main();
exit;
sub main{
&init();
my $i = 0;
print join("\t", map { "test" . $i++ } @tests) . $/;
print "-" x 60 . $/;
for (1 .. $ntest){
my @results;
my $i = 0;
foreach my $t (@tests){
my $res = &doit($t);
push @results, $res;
$total_elapse[$i] += $res;
$i++;
}
print join("\t", @results) . $/;
}
print "-" x 60 . $/;
print join("\t", map { sprintf("%.2f", $_ / $ntest) } @total_elapse) . $/;
print $/;
$i = 0;
print join("", map { "test" . $i++ . "\t" . $_ } @tests);
}
sub init{
my $iter = &generator();
print STDERR "generating data...$/";
my ($fh, $filename) = tempfile();
for my $i (1 .. $rows){
print $fh join("\t", $iter->()) . $/;
}
close $fh;
my $create_sql = <<_SQL;
DROP TABLE IF EXISTS bench1;
CREATE TABLE bench1 (id int8, amount int8, sector int);
_SQL
my $out;
print `$psql -c '$create_sql' $dbname`;
print `echo "\\\\copy bench1 from \'$filename\'" | $psql $dbname`;
unlink $filename;
}
sub doit{
my $sql = shift;
my $out = `echo "\\timing\n\\o /dev/null\n$sql\n" | $psql $dbname`;
my @outs = split(/\n/, $out);
if($outs[$#outs] =~ /^Time: ([0-9\.]+) ms/m){
return $1;
}
die "coudn't parse" . $/;
}
sub generator{
my $i = 0;
return sub{
$i++;
return ($i, $i % 3 == 0 ? $i : -$i, $i % 1000);
}
}
this document is as of 2008/09/17, written by Hitoshi Harada (umi.tanuki@gmail.com)