Public RUM repository is available on https://github.com/postgrespro/rum. We know several bugs in RUM, hope to fix them soon.
Table pglist contains about million posts from postgres mailing lists, fts - is a fts representation of post - tsvector (subject, body_plain, author).
I created several fts indices:
1. GIN index
2. RUM index - (GIN + positions as addinfo)
3. RUM index with sent time as addinfo
4. as 3, but postings are sorted by sent time
RUM index is bigger than GIN, since it contains more information, which allow to speedup certain types of queries, namely:
2. ranking fts - index returns results in the order of position based relevance.
3. fts results ordered by sent time (latest fts results) - index returns results in order of sent time, stored in postings
4. even faster than 3, since postings are stored in the sent time order
The price for speedup is the size of index !
\d pglist Table "public.pglist" Column | Type | Modifiers ------------+-----------------------------+----------- id | integer | sent | timestamp without time zone | subject | text | author | text | body_plain | text | fts | tsvector | Indexes: "pglist_fts_gin_idx4" gin (fts) "pglist_fts_rum_idx4" rum (fts rum_tsvector_ops) "pglist_fts_ts_order_rum_idx4" rum (fts rum_tsvector_timestamp_ops, sent) WITH (attach=sent, "to"=fts, order_by_attach=t) "pglist_fts_ts_rum_idx4" rum (fts rum_tsvector_timestamp_ops, sent) WITH (attach=sent, "to"=fts) "pglist_sent_idx" btree (sent) select pg_size_pretty(sum(pg_column_size(fts))) as fts,pg_size_pretty(pg_table_size('pglist_fts_gin_idx4')) as gin, pg_size_pretty(pg_table_size('pglist_fts_rum_idx4')) as rum, pg_size_pretty(pg_table_size('pglist_fts_ts_rum_idx4')) as rum_ts, pg_size_pretty(pg_table_size('pglist_fts_ts_order_rum_idx4')) as rum_ts_order from pglist; fts | gin | rum | rum_ts | rum_ts_order ---------+--------+--------+---------+-------------- 1302 MB | 535 MB | 980 MB | 1531 MB | 1921 MB (1 row)
RUM access method was implemented as an extension using Extensible Access Methods in 9.6. GENERIC WAL in current state of art doesn't recognizes difference between pages with shifts, so RUM could generate a lot of WAL traffic, for example, for RUM index with sent time ordered by sent time CREATE INDEX generates about 186 GB of WAL (compare to 1921 MB index size), since postings now arranged not by their location on page and almost each posting generates new wal record.
I calculated WAL traffic like this:
CREATE INDEX ....
The number of postings could be obtained by this query:
select sum(length(fts)) from pglist; sum ---------- 84472729 (1 row)
So, if we assume average fullness as 75% and each posting generates traffic about half of page, then we could estimate WAL-traffic as
84472729*4096/1024/1024/1024*0.75 ~ 242 GB, which is close to 186 GB :)
Using some optimization we were able to reduce traffic to 15 GB, which is still high, compare to index size (1.9 GB). Finally, we got an idea to write WAL records after index created and we got WAL traffic 1.5GB, which is even smaller than the index.
Notice, that this valid only for CREATE INDEX, insertions to RUM will generate big traffic unless GENERIC WAL become more smart.
I think the same optimization we could apply to GiST, GIN and SP-GIST. Saving WAL-traffic on my SSD-based system had low impact on time of index creation, but I suspect it should be noticeable on real hard drives.