|
PostgreSQL - analyzing a query to improve speed
1 March 2003
|
|
PostgreSQL is the leading open source database. Many people
consider PostgreSQL to be a competitor to mySQL.
People who say that don't know either product. PostgreSQL isn't competing with mySQL; it's competing
with Oracle. Yes, it's being quiet about it. But the users aren't.
They were Oracle users and they're coming to PostgreSQL.
Of course the skeptics will ask for proof of my claims. I don't have any. I have anecdotal evidence.
And third hand information. Such as Bruce Momjian's recent
overseas trip report.
All the PostgreSQL versus mySQL
noise is moot. What people should be concentrating on is
PostgreSQL versus Oracle
Put that in your DBMS and analyse it.
|
|
What's coming up
|
Anyone who has ever written a complex query has seen that despite the best of designs, they aren't always
fast. Over the past day or two I've been struggling with the slowest part of
FreshPorts. I think I've come up with a relatively simple
solution. The trip to that solution was rather interesting. I will reproduce it below and show you
how it went from a query which took many seconds to a query which takes less than 0.01 seconds.
The first part of the article will cover a bit of FreshPorts introduction, the underlying database tables
and how they relate to each other. Then we'll get into the queries, follow through the analyzing of the
query and finally the optimized result.
This article should be useful to anyone using PostgreSQL or designing databases. It should
also prove useful to those who wonder how database work and what it takes to make them
run quickly. It might also provide a venue for my own education. People reading this article
might know a better way to do things. Information is a two-way street.
You might want to read PostgreSQL Database Performance Tuning
first. It gives you a few idea about database tuning, but not much on
tuning a query. PostgreSQL Hardware Performance Tuning
might also interest you.
Least of all, it should show people how PostgreSQL can do this, but mySQL can't... (that's not the
goal of the article, and I'm not going to attempt to prove it... please just accept that mySQL
can't do this type of thing...)
|
|
Explain it to me...
|
I'll be making use of the PostgreSQL EXPLAIN
command. That's found in the Reference Manual
of the PostgreSQL
online documentation.
If what appears below doesn't make sense, try reading the above references first.
EXPLAIN shows the show the execution plan of a statement. You get to see what tables will be used,
what indexes are involved, and the number of rows involved. Execution plans make use of statistics
generated by the VACUUM command. If you aren't getting the results you want, try running VACUUM ANALYZE
over the table[s] in question.
EXPLAIN ANALYZE goes one step further and actually executes the command. This provides the actual run
time for the various parts of the query. Be warned that EXPLAIN ANALYSE will actually delete, update, or
insert any data involved in the query. Read the documentation for details.
Here are a few things to look for when reading the output of EXPLAIN:
- Look for high costs
- look for sequential scans
- look for large numbers of rows
- look for index scans
High costs are an indication of something which takes a lot of set up. Sequential scans indicate a scan
of the entire table. Perhaps an index is required. Large numbers of rows might indicate a place where
you can cut down the result set earlier in the query. Index scans are usually a good thing as they tend
to speed up the query (in most instances).
Items to note
- For small tables, an index costs more than it is worth in many queries. Indexes are only really
useful for larger data sets. Don't be upset if the query runs quickly but doesn't use an index.
- If you think the index should be used, but it isn't being touched, try a VACUUM ANALYZE on that table.
Then test the query again.
|
|
FreshPorts - what is it?
|
FreshPorts (and its sister FreshSource capture
and store the commits to a source repository. It can work with any source repository
but at present it is restricted to the FreeBSD
source tree. FreshPorts concentrates on the
Ports subtree, while FreshSource
captures and records the entire tree.
By its nature, FreshPorts is the more complex of the two sites. It was also the first.
Users create watch lists of whatever ports they wish to keep up to date with. When a commit
occurs against those ports, the user is sent an email notification. Users can also
view all commits recorded against a given port. When a port can consist of several dozen files,
this information is very useful when tracking updates.
|
|
FreshPorts - the tables and sample data
|
The most important FreshPorts table is the commit_log table.
create table commit_log
(
id serial not null,
commit_date timestamp not null,
committer text not null,
primary key (id)
);
create index commit_log_commit_date on commit_log (commit_date);
The actual table contains more columns but the above is sufficient for demonstration purposes.
Each new commit against the FreeBSD source repository is loaded into this table. FreshPorts is only
interested in commits against the ports subtree. However, FreshSource and FreshPorts use the same database.
A second table is used to indicate commits against particular ports:
create table commit_log_ports
(
commit_log_id integer not null,
port_id integer not null,
primary key (commit_log_id, port_id)
);
alter table commit_log_ports
add foreign key (commit_log_id)
references commit_log (id) on update cascade on delete cascade;
alter table commit_log_ports
add foreign key (port_id)
references ports (id) on update cascade on delete cascade;
I will only mention the Ports table so that you know it exists and is referred to by many other tables.
The main problem at hand concerns the commit_log and
commit_log_ports table.
You can download the sample tables and data (859KB) used in this
article should you wish to play with the results. Please note that I have only supplied the columns necessary
to perform the queries indicated below. As such, your results will vary. NOTE: the examples were run on
version 7.3.
The tarball contains a .sql file which can be loaded into
your database with the following command:
createdb test
psql test < postgresql-analyze.sql
Feedback is appreciated.
|
|
So what's the problem?
|
FreshPorts, by default, displays the last 100 commits. If you didn't think about it, you'd think this
was the answer:
EXPLAIN ANALYZE
SELECT *
FROM commit_log
ORDER BY commit_date
LIMIT 100;
Limit (cost=0.00..48.12 rows=100 width=252) (actual time=0.06..2.40 rows=100 loops=1)
-> Index Scan using commit_log_commit_date on commit_log (cost=0.00..32975.88 rows=68533 width=252) (actual time=0.05..1.59 rows=101 loops=1)
Total runtime: 2.98 msec
(3 rows)
As you can see, it's rather fast to get the last 100 commits. Just under 3ms. That's 0.003 seconds.
That's damn fast in my book. Mind you, there's only 68533 rows in that table.
Note that the (3 rows) refers to the EXPLAIN ANALYSE output, not the number of rows in the
result set. Compare that to the 100 which appears in the Limit line in the output.
Unfortunately, that's the last 100 commits. Not the last 100 commits against the ports subtree.
For that, we have to use the commit_log_ports table.
EXPLAIN ANALYZE
SELECT distinct commit_log.id, commit_log.commit_date
FROM commit_log, commit_log_ports
WHERE commit_log.id = commit_log_ports.commit_log_id
ORDER BY commit_log.commit_date
LIMIT 100;
Limit (cost=13456.18..13463.68 rows=100 width=16) (actual time=4190.89..4193.10 rows=100 loops=1)
-> Unique (cost=13456.18..13938.43 rows=6430 width=16) (actual time=4190.88..4192.31 rows=101 loops=1)
-> Sort (cost=13456.18..13616.93 rows=64300 width=16) (actual time=4190.87..4191.39 rows=113 loops=1)
Sort Key: commit_log.commit_date, commit_log.id
-> Hash Join (cost=4103.81..7662.06 rows=64300 width=16) (actual time=1065.79..3150.16 rows=64300 loops=1)
Hash Cond: ("outer".commit_log_id = "inner".id)
-> Seq Scan on commit_log_ports (cost=0.00..1048.00 rows=64300 width=4) (actual time=0.11..434.34 rows=64300 loops=1)
-> Hash (cost=3032.33..3032.33 rows=68533 width=12) (actual time=1065.08..1065.08 rows=0 loops=1)
-> Seq Scan on commit_log (cost=0.00..3032.33 rows=68533 width=12) (actual time=0.10..645.86 rows=68544 loops=1)
Total runtime: 4195.82 msec
(10 rows)
OUCH! That's terrible. That's four seconds! Why did it take so long?
The query is returning all rows in the two tables, then selecting only the first one hundred rows.
That's quite inefficient, but it's not what we want to do. We can make things faster.
|
|
Don't search everything; provide a starting point
|
In this example, we're going to provide a starting place for the query. We aren't going
to do the JOIN on the whole table. We are going to limit it to the most recent commits.
Let's assume that in a given week, there are at least 100 commits against the ports tree.
We will do this by adding the following condition to the WHERE clause:
commit_log.commit_date > now() - interval '1 week'
The new query is:
EXPLAIN ANALYZE
SELECT distinct commit_log.id, commit_log.commit_date
FROM commit_log, commit_log_ports
WHERE commit_log.id = commit_log_ports.commit_log_id
AND commit_log.commit_date > now() - interval '1 week'
ORDER BY commit_log.commit_date
LIMIT 100;
Limit (cost=7425.72..7433.22 rows=100 width=16) (actual time=1238.13..1238.62 rows=21 loops=1)
-> Unique (cost=7425.72..7586.47 rows=2143 width=16) (actual time=1238.12..1238.44 rows=21 loops=1)
-> Sort (cost=7425.72..7479.31 rows=21433 width=16) (actual time=1238.11..1238.22 rows=28 loops=1)
Sort Key: commit_log.commit_date, commit_log.id
-> Hash Join (cost=3603.44..5883.85 rows=21433 width=16) (actual time=1175.11..1237.71 rows=28 loops=1)
Hash Cond: ("outer".commit_log_id = "inner".id)
-> Seq Scan on commit_log_ports (cost=0.00..1048.00 rows=64300 width=4) (actual time=0.10..420.66 rows=64300 loops=1)
-> Hash (cost=3546.33..3546.33 rows=22844 width=12) (actual time=419.03..419.03 rows=0 loops=1)
-> Seq Scan on commit_log (cost=0.00..3546.33 rows=22844 width=12) (actual time=0.19..418.47 rows=57 loops=1)
Filter: (commit_date > (now() - '7 days'::interval))
Total runtime: 1238.96 msec
(11 rows)
Now the query takes only just over 1 second. That's not bad. But we can do better.
Let's now assume we know how far back in time the 100th previous port commit was.
It was on 18 February 2003, at 16:02:15.
EXPLAIN ANALYZE
SELECT distinct commit_log.id, commit_log.commit_date
FROM commit_log, commit_log_ports
WHERE commit_log.id = commit_log_ports.commit_log_id
AND commit_log.commit_date > '2003-02-18 16:02:15.133047-05'
ORDER BY commit_log.commit_date
LIMIT 100;
Limit (cost=1562.07..1563.65 rows=21 width=16) (actual time=737.54..739.94 rows=100 loops=1)
-> Unique (cost=1562.07..1563.71 rows=22 width=16) (actual time=737.52..739.15 rows=101 loops=1)
-> Sort (cost=1562.07..1562.62 rows=219 width=16) (actual time=737.52..738.09 rows=145 loops=1)
Sort Key: commit_log.commit_date, commit_log.id
-> Hash Join (cost=181.33..1553.56 rows=219 width=16) (actual time=507.25..736.43 rows=146 loops=1)
Hash Cond: ("outer".commit_log_id = "inner".id)
-> Seq Scan on commit_log_ports (cost=0.00..1048.00 rows=64300 width=4) (actual time=0.10..419.82 rows=64300 loops=1)
-> Hash (cost=180.74..180.74 rows=233 width=12) (actual time=5.68..5.68 rows=0 loops=1)
-> Index Scan using commit_log_commit_date on commit_log (cost=0.00..180.74 rows=233 width=12) (actual time=0.34..4.51 rows=215 loops=1)
Index Cond: (commit_date > '2003-02-18 16:02:15.133047-05'::timestamp with time zone)
Total runtime: 740.66 msec
(11 rows)
Why is this faster? The key is in the explain output. In the previous example, it was merely
filtering. But in this example, we are giving it a hard-and-fast starting point, as demonstrated
by the Index Cond: output. Essentially, we are providing a date, instead of a function call.
This allows the query to be much faster and finding what is needed.
|
|
We can go faster
|
This isn't fast enough. What if I know what the last 100 port commits are and I have them in a table?
A table like this:
create table clp
(
commit_log_id integer not null,
commit_date timestamp not null,
primary key (commit_log_id)
);
alter table clp
add foreign key (commit_log_id)
references commit_log (id) on update cascade on delete cascade;
Note that this table is not included in the example tarball.
The table is named clp after the commit_log_ports
table.
We can populate the table with this query:
EXPLAIN ANALYZE
INSERT INTO clp
SELECT DISTINCT commit_log_id, commit_date
FROM commit_log, commit_log_ports
WHERE commit_log.id = commit_log_ports.commit_log_id
ORDER BY commit_log.commit_date desc
LIMIT 100;
Subquery Scan "*SELECT*" (cost=13456.18..13463.68 rows=100 width=16) (actual time=4165.77..4169.47 rows=100 loops=1)
-> Limit (cost=13456.18..13463.68 rows=100 width=16) (actual time=4165.75..4168.39 rows=100 loops=1)
-> Unique (cost=13456.18..13938.43 rows=6430 width=16) (actual time=4165.74..4167.59 rows=101 loops=1)
-> Sort (cost=13456.18..13616.93 rows=64300 width=16) (actual time=4165.73..4166.48 rows=145 loops=1)
Sort Key: commit_log.commit_date, commit_log_ports.commit_log_id
-> Hash Join (cost=4103.81..7662.06 rows=64300 width=16) (actual time=1066.31..3157.63 rows=64300 loops=1)
Hash Cond: ("outer".commit_log_id = "inner".id)
-> Seq Scan on commit_log_ports (cost=0.00..1048.00 rows=64300 width=4) (actual time=0.11..430.26 rows=64300 loops=1)
-> Hash (cost=3032.33..3032.33 rows=68533 width=12) (actual time=1065.58..1065.58 rows=0 loops=1)
-> Seq Scan on commit_log (cost=0.00..3032.33 rows=68533 width=12) (actual time=0.10..647.40 rows=68544 loops=1)
Total runtime: 4176.07 msec
(11 rows)
Of course, the cost to insert into this table is relatively high, because of the previously
mentioned examples and reasons. The ultimate goal is to maintain this table as we go along.
I'll explain that in a subsequent section. For now, let's just try out this table and
see how things go.
We will assume that new entries are added to this table all the time. If we query using this table:
EXPLAIN ANALYZE
SELECT DISTINCT clp.commit_log_id, clp.commit_date
FROM commit_log, commit_log_ports, clp
WHERE commit_log.id = commit_log_ports.commit_log_id
AND clp.commit_log_id = commit_log.id
ORDER BY clp.commit_date desc
LIMIT 100;
Limit (cost=5703.16..5710.66 rows=100 width=20) (actual time=731.50..733.90 rows=100 loops=1)
-> Unique (cost=5703.16..5772.32 rows=922 width=20) (actual time=731.49..733.11 rows=100 loops=1)
-> Sort (cost=5703.16..5726.22 rows=9221 width=20) (actual time=731.48..732.05 rows=144 loops=1)
Sort Key: clp.commit_date, clp.commit_log_id
-> Hash Join (cost=3714.68..5095.91 rows=9221 width=20) (actual time=503.23..730.43 rows=144 loops=1)
Hash Cond: ("outer".commit_log_id = "inner".id)
-> Seq Scan on commit_log_ports (cost=0.00..1048.00 rows=64300 width=4) (actual time=0.09..417.49 rows=64300 loops=1)
-> Hash (cost=3712.18..3712.18 rows=1000 width=16) (actual time=7.03..7.03 rows=0 loops=1)
-> Nested Loop (cost=0.00..3712.18 rows=1000 width=16) (actual time=0.42..6.41 rows=100 loops=1)
-> Seq Scan on clp (cost=0.00..20.00 rows=1000 width=12) (actual time=0.10..0.82 rows=100 loops=1)
-> Index Scan using commit_log_pkey on commit_log (cost=0.00..3.68 rows=1 width=4) (actual time=0.03..0.04 rows=1 loops=100)
Index Cond: ("outer".commit_log_id = commit_log.id)
Total runtime: 734.72 msec
(13 rows)
Note also that to get the above time, you will have to do a VACUUM ANALYZE on
the clp table so that the query engine will have
data regarding that table.
Look at where all the time is used up on this query, it's in the LIMIT.
That time represents almost the entire query. But we can use the same idea
but structure the query in a different way so as to increase performance.
EXPLAIN ANALYZE
SELECT DISTINCT clp.commit_log_id, clp.commit_date
FROM clp, commit_log, commit_log_ports
WHERE clp.commit_log_id = commit_log.id
AND commit_log.id = commit_log_ports.commit_log_id
ORDER BY clp.commit_date desc
LIMIT 100;
Limit (cost=6538.88..6545.86 rows=93 width=20) (actual time=934.51..936.90 rows=100 loops=1)
-> Unique (cost=6538.88..6545.92 rows=94 width=20) (actual time=934.50..936.10 rows=100 loops=1)
-> Sort (cost=6538.88..6541.23 rows=938 width=20) (actual time=934.49..935.06 rows=144 loops=1)
Sort Key: clp.commit_date, clp.commit_log_id
-> Merge Join (cost=0.00..6492.57 rows=938 width=20) (actual time=922.63..933.37 rows=144 loops=1)
Merge Cond: ("outer".commit_log_id = "inner".commit_log_id)
-> Nested Loop (cost=0.00..3744.18 rows=1000 width=16) (actual time=0.50..6.95 rows=100 loops=1)
-> Index Scan using clp_pkey on clp (cost=0.00..52.00 rows=1000 width=12) (actual time=0.31..1.21 rows=100 loops=1)
-> Index Scan using commit_log_pkey on commit_log (cost=0.00..3.68 rows=1 width=4) (actual time=0.03..0.04 rows=1 loops=100)
Index Cond: ("outer".commit_log_id = commit_log.id)
-> Index Scan using clp_cli on commit_log_ports (cost=0.00..2469.87 rows=64300 width=4) (actual time=0.31..599.70 rows=64300 loops=1)
Total runtime: 937.68 msec
(12 rows)
Well, that's not quite what i wanted to do. But it does demonstrate a useful point.
If you rearrange the table names in the FROM clause, it can affect performance.
Now it's time to apply our old trick of providing a starting point. We will
restrict the commits to those after a certain time:
EXPLAIN ANALYZE
SELECT DISTINCT clp.commit_log_id, clp.commit_date
FROM clp, commit_log, commit_log_ports
WHERE commit_log.id = commit_log_ports.commit_log_id
AND clp.commit_log_id = commit_log.id
AND commit_log.commit_date > '2003-02-18 16:02:15.133047-05'
ORDER BY clp.commit_date desc
LIMIT 100;
Limit (cost=299.65..299.67 rows=1 width=20) (actual time=13.27..15.66 rows=100 loops=1)
-> Unique (cost=299.65..299.67 rows=1 width=20) (actual time=13.26..14.87 rows=100 loops=1)
-> Sort (cost=299.65..299.66 rows=3 width=20) (actual time=13.25..13.81 rows=144 loops=1)
Sort Key: clp.commit_date, clp.commit_log_id
-> Nested Loop (cost=169.43..299.63 rows=3 width=20) (actual time=4.87..12.33 rows=144 loops=1)
-> Hash Join (cost=169.43..194.47 rows=3 width=16) (actual time=4.64..6.69 rows=100 loops=1)
Hash Cond: ("outer".commit_log_id = "inner".id)
-> Seq Scan on clp (cost=0.00..20.00 rows=1000 width=12) (actual time=0.02..0.77 rows=100 loops=1)
-> Hash (cost=168.97..168.97 rows=186 width=4) (actual time=4.52..4.52 rows=0 loops=1)
-> Index Scan using commit_log_commit_date on commit_log (cost=0.00..168.97 rows=186 width=4) (actual time=0.45..3.37 rows=215 loops=1)
Index Cond: (commit_date > '2003-02-18 16:02:15.133047-05'::timestamp with time zone)
-> Index Scan using clp_cli on commit_log_ports (cost=0.00..38.58 rows=9 width=4) (actual time=0.02..0.03 rows=1 loops=100)
Index Cond: ("outer".commit_log_id = commit_log_ports.commit_log_id)
Total runtime: 16.44 msec
(14 rows)
Now that's fast. Very fast. In fact, it's more than 57 times faster.
|
|
Another approach
|
We now have a very fast query which will return the last 100 commits to the ports tree.
Let's now figure out how to get the list of ports associated with those commits.
EXPLAIN ANALYZE
SELECT clp.commit_log_id, clp.commit_date, ports.name
FROM clp, commit_log, commit_log_ports, ports
WHERE commit_log.id = commit_log_ports.commit_log_id
AND clp.commit_log_id = commit_log.id
AND commit_log.commit_date > '2003-02-18 16:02:15.133047-05'
AND commit_log_ports.port_id = ports.id
ORDER BY clp.commit_date desc
LIMIT 100;
Limit (cost=314.99..314.99 rows=2 width=78) (actual time=389.48..390.66 rows=100 loops=1)
-> Sort (cost=314.99..314.99 rows=3 width=78) (actual time=389.46..389.86 rows=101 loops=1)
Sort Key: clp.commit_date
-> Nested Loop (cost=169.43..314.97 rows=3 width=78) (actual time=40.67..387.90 rows=144 loops=1)
-> Nested Loop (cost=169.43..299.63 rows=3 width=24) (actual time=16.52..24.91 rows=144 loops=1)
-> Hash Join (cost=169.43..194.47 rows=3 width=16) (actual time=16.15..18.40 rows=100 loops=1)
Hash Cond: ("outer".commit_log_id = "inner".id)
-> Seq Scan on clp (cost=0.00..20.00 rows=1000 width=12) (actual time=10.55..11.38 rows=100 loops=1) -> Hash (cost=168.97..168.97 rows=186 width=4) (actual time=5.48..5.48 rows=0 loops=1)
-> Index Scan using commit_log_commit_date on commit_log (cost=0.00..168.97 rows=186 width=4) (actual time=0.34..4.32 rows=215 loops=1)
Index Cond: (commit_date > '2003-02-18 16:02:15.133047-05'::timestamp with time zone)
-> Index Scan using clp_cli on commit_log_ports (cost=0.00..38.58 rows=9 width=8) (actual time=0.03..0.04 rows=1 loops=100)
Index Cond: ("outer".commit_log_id = commit_log_ports.commit_log_id)
-> Index Scan using ports_pkey on ports (cost=0.00..6.00 rows=1 width=54) (actual time=2.49..2.50 rows=1 loops=144)
Index Cond: ("outer".port_id = ports.id)
Total runtime: 391.55 msec
(16 rows)
That query only returned one hundred rows.... Oh yes, we need a subselect here, we want to limit
the result set to 100 commits, not 100 ports. This is the query we will use as a starting point:
EXPLAIN ANALYZE
SELECT clp.commit_log_id
FROM clp
ORDER BY clp.commit_date DESC
LIMIT 100;
Limit (cost=69.83..70.08 rows=100 width=12) (actual time=1.30..2.48 rows=100 loops=1)
-> Sort (cost=69.83..72.33 rows=1000 width=12) (actual time=1.29..1.69 rows=100 loops=1)
Sort Key: commit_date
-> Seq Scan on clp (cost=0.00..20.00 rows=1000 width=12) (actual time=0.02..0.68 rows=100 loops=1)
Total runtime: 3.02 msec
(5 rows)
With that, let's join it up with the commit_log table:
EXPLAIN ANALYZE
SELECT commit_log.id
FROM commit_log JOIN
(SELECT clp.commit_log_id
FROM clp
ORDER BY clp.commit_date DESC
LIMIT 100) AS TMP
ON commit_log.id = TMP.commit_log_id;
Nested Loop (cost=69.83..439.30 rows=100 width=8) (actual time=1.76..9.11 rows=100 loops=1)
-> Subquery Scan tmp (cost=69.83..70.08 rows=100 width=12) (actual time=1.38..3.50 rows=100 loops=1)
-> Limit (cost=69.83..70.08 rows=100 width=12) (actual time=1.37..2.59 rows=100 loops=1)
-> Sort (cost=69.83..72.33 rows=1000 width=12) (actual time=1.36..1.79 rows=100 loops=1)
Sort Key: commit_date
-> Seq Scan on clp (cost=0.00..20.00 rows=1000 width=12) (actual time=0.02..0.68 rows=100 loops=1)
-> Index Scan using commit_log_pkey on commit_log (cost=0.00..3.68 rows=1 width=4) (actual time=0.03..0.04 rows=1 loops=100)
Index Cond: (commit_log.id = "outer".commit_log_id)
Total runtime: 9.69 msec
(9 rows)
As you can see, that's still pretty fast, and still only 100 records. Now we have the rows from
commit_log which relate to the last 100 commits to the ports tree.
Let's get the ports touched by those commits.
EXPLAIN ANALYZE
SELECT commit_log.id
FROM commit_log JOIN
(SELECT clp.commit_log_id
FROM clp
ORDER BY clp.commit_date DESC
LIMIT 100) AS TMP
ON commit_log.id = TMP.commit_log_id, commit_log_ports
WHERE commit_log_ports.commit_log_id = commit_log.id;
Hash Join (cost=439.55..1810.22 rows=94 width=12) (actual time=504.90..731.12 rows=144 loops=1)
Hash Cond: ("outer".commit_log_id = "inner".id)
-> Seq Scan on commit_log_ports (cost=0.00..1048.00 rows=64300 width=4) (actual time=0.09..416.59 rows=64300 loops=1)
-> Hash (cost=439.30..439.30 rows=100 width=8) (actual time=9.64..9.64 rows=0 loops=1)
-> Nested Loop (cost=69.83..439.30 rows=100 width=8) (actual time=1.84..9.09 rows=100 loops=1)
-> Subquery Scan tmp (cost=69.83..70.08 rows=100 width=12) (actual time=1.44..3.56 rows=100 loops=1)
-> Limit (cost=69.83..70.08 rows=100 width=12) (actual time=1.43..2.63 rows=100 loops=1)
-> Sort (cost=69.83..72.33 rows=1000 width=12) (actual time=1.42..1.82 rows=100 loops=1)
Sort Key: commit_date
-> Seq Scan on clp (cost=0.00..20.00 rows=1000 width=12) (actual time=0.09..0.73 rows=100 loops=1)
-> Index Scan using commit_log_pkey on commit_log (cost=0.00..3.68 rows=1 width=4) (actual time=0.03..0.04 rows=1 loops=100)
Index Cond: (commit_log.id = "outer".commit_log_id)
Total runtime: 731.93 msec
(13 rows)
Wow! That is 75 times slower than before! Surely we can do better! I think the clue is the
rows=64300 entry. Let's try reducing that. In this particular
case, I know we're only looking at values within the clp table.
So let's create this function:
CREATE OR REPLACE FUNCTION Anchor_CLID() RETURNS integer
AS '
SELECT min(commit_log_id) AS RESULT
FROM clp;
' LANGUAGE SQL IMMUTABLE;
See also how to make the above faster.
Note that the above is for version 7.3 which first introduced the IMMUTABLE designation.
This function obtains the minimum commit_log_id value
from the clp table. Let's use that in our query:
EXPLAIN ANALYZE
SELECT commit_log_id, port_id FROM
(SELECT commit_log.id
FROM commit_log JOIN
(SELECT clp.commit_log_id
FROM clp
ORDER BY clp.commit_date DESC
LIMIT 100) AS TMP
ON commit_log.id = TMP.commit_log_id) AS TMP2 JOIN commit_log_ports
ON commit_log_ports.commit_log_id = TMP2.id
AND commit_log_ports.commit_log_id > Anchor_CLID();
Merge Join (cost=442.62..594.26 rows=1 width=16) (actual time=8.52..12.97 rows=141 loops=1)
Merge Cond: ("outer".id = "inner".commit_log_id)
-> Sort (cost=442.62..442.87 rows=100 width=8) (actual time=8.44..8.86 rows=100 loops=1)
Sort Key: commit_log.id
-> Nested Loop (cost=69.83..439.30 rows=100 width=8) (actual time=1.39..7.81 rows=100 loops=1)
-> Subquery Scan tmp (cost=69.83..70.08 rows=100 width=12) (actual time=1.32..3.41 rows=100 loops=1)
-> Limit (cost=69.83..70.08 rows=100 width=12) (actual time=1.31..2.51 rows=100 loops=1)
-> Sort (cost=69.83..72.33 rows=1000 width=12) (actual time=1.30..1.71 rows=100 loops=1)
Sort Key: commit_date
-> Seq Scan on clp (cost=0.00..20.00 rows=1000 width=12) (actual time=0.01..0.71 rows=100 loops=1)
-> Index Scan using commit_log_pkey on commit_log (cost=0.00..3.68 rows=1 width=4) (actual time=0.02..0.02 rows=1 loops=100)
Index Cond: (commit_log.id = "outer".commit_log_id)
-> Index Scan using clp_cli on commit_log_ports (cost=0.00..149.91 rows=39 width=8) (actual time=0.03..1.35 rows=141 loops=1)
Index Cond: (commit_log_id > 69061)
Total runtime: 13.81 msec
(15 rows)
I'd say that's fast enough for my needs. The key to this function is the IMMUTABLE
designation. This indicates that the function returns the same value for a given set of parameters. This
allow the query planner to make certain optimizations that it could not otherwise do. With this
designation, the query planner considers the function return value to be a constant, and therefore it
can make use of the indexes available.
Please refer to the CREATE FUNCTION documentation for more information.
In fact, my function isn't really IMMUTABLE. It's actually STABLE. If I change the function to stable,
the time goes up to nearly 19 seconds. If I add a < condition to the range, the time drops to just under
one second, but it never gets near to the speed of the IMMUTABLE function. Please read the CREATE FUNCTION
documentation and pick the designation which best suits your particular application.
|
|
More about that clp table
|
For my specific needs, I'll be doing some more work on the clp table
in order to maintain it as things go along. That may turn out to be the subject of a future article.
For now, the starting point is a couple of rules which will maintain the data:
CREATE OR REPLACE RULE commit_log_ports_insert AS
ON INSERT TO commit_log_ports DO
INSERT INTO clp
SELECT new.commit_log_id,
(SELECT commit_date
FROM commit_log
WHERE id = new.commit_log_id)
WHERE NOT EXISTS
(SELECT commit_log_id
FROM clp
WHERE commit_log_id = new.commit_log_id);
CREATE OR REPLACE RULE commit_log_update AS
ON UPDATE TO commit_log DO
UPDATE clp
SET commit_date = new.commit_date
WHERE commit_log_id = new.id;
The first rule will keep populating the clp table as new ports tree commits are
added. It can handle multiple ports per commit by only inserting that which does not already exist.
The second rule will keep the commit_date accurate should any updates
occur.
What about deletes? That's taken care of in the foreign key defintion of the table. Any deletes on the
primary table will delete the corresponding row from this table.
Here's a query which will retain the latest 100 entries:
DELETE FROM clp
WHERE commit_log_id NOT in (
SELECT commit_log_id
FROM clp
ORDER BY commit_date
LIMIT 100);
Seq Scan on clp (cost=0.00..570.12 rows=50 width=6) (actual time=9.81..399.59 rows=400 loops=1)
Filter: (subplan)
SubPlan
-> Materialize (cost=5.62..5.62 rows=100 width=12) (actual time=0.02..0.38 rows=90 loops=500)
-> Limit (cost=5.37..5.62 rows=100 width=12) (actual time=6.73..7.99 rows=100 loops=1)
-> Sort (cost=5.37..5.62 rows=101 width=12) (actual time=6.72..7.14 rows=101 loops=1)
Sort Key: commit_date
-> Seq Scan on clp (cost=0.00..2.01 rows=101 width=12) (actual time=0.21..3.87 rows=500 loops=1)
Total runtime: 410.57 msec
(9 rows)
But as you can see, it takes quite some time. Luckily, with some help from the
PostgreSQL SQL mailing list, I was
able to find a better way:
EXPLAIN ANALYZE
DELETE FROM clp
WHERE commit_date < (SELECT commit_date
FROM clp
ORDER BY commit_date DESC
OFFSET 99
LIMIT 1);
Seq Scan on clp (cost=0.00..2.26 rows=34 width=6) (actual time=7.98..11.42 rows=400 loops=1)
Filter: (commit_date < $0)
InitPlan
-> Limit (cost=5.62..5.62 rows=1 width=8) (actual time=7.29..7.31 rows=1 loops=1)
-> Sort (cost=5.37..5.62 rows=101 width=8) (actual time=6.49..6.90 rows=101 loops=1)
Sort Key: commit_date
-> Seq Scan on clp (cost=0.00..2.01 rows=101 width=8) (actual time=0.21..3.73 rows=500 loops=1)
Total runtime: 15.36 msec
That's nearly 27 times faster than my original. Thanks folks.
This could be run from a perl script started by a cron job. It's not a critical task,
but it does keep things tidy. I'd also VACUUM FULL ANALYSE that table at least once a week if not daily.
|
|
|
Robert Treat just reminded me of way to make that function
run faster. It uses an old trick I had forgotten about. The original query was:
SELECT min(commit_log_id) AS RESULT
FROM clp;
Aggregate (cost=9.25..9.25 rows=1 width=4) (actual time=5.22..5.23 rows=1 loops=1)
-> Seq Scan on clp (cost=0.00..8.00 rows=500 width=4) (actual time=0.02..3.02 rows=500 loops=1)
Total runtime: 5.31 msec
Let's try this instead:
SELECT commit_log_id
FROM clp
ORDER BY commit_log DESC
LIMIT 1;
Limit (cost=0.00..0.03 rows=1 width=4) (actual time=0.05..0.07 rows=1 loops=1)
-> Index Scan Backward using clp_pkey on clp (cost=0.00..12.50 rows=500 width=4) (actual time=0.04..0.05 rows=2 loops=1)
Total runtime: 0.15 msec
That new query is 35.4 times faster. This clearly demonstrates the benefits of cost of an aggregate function
versus the speed of an index. Mind you, it could also be argued that the aggregate functions could use the index.
|
|
Is that it?
|
That's it. You've seen how adjusting the query and taking a different approach
can often improve the speed. It is not always clear how to adjust the query,
but now that you have seen a few examples, that should get you started.
|
|