Inverted indexes store mappings from values within a container column (such as a JSONB
document) to the row that holds that value. They are used to speed up containment searches, e.g., "show me all of the rows from this table which have a JSON column that contains the key-value pair {"location":"NYC"}
". Inverted indexes are commonly used in document retrieval systems.
CockroachDB stores the contents of the following data types in inverted indexes:
JSONB
column, see the JSON tutorial.How do inverted indexes work?
Standard indexes work well for searches based on prefixes of sorted data. However, data types like JSONB
or arrays cannot be queried without a full table scan, since they do not adhere to ordinary value prefix comparison operators. JSONB
in particular needs to be indexed in a more detailed way than what a standard index provides. This is where inverted indexes prove useful.
Inverted indexes filter on components of tokenizable data. The JSONB
data type is built on two structures that can be tokenized:
- Objects - Collections of key-value pairs where each key-value pair is a token.
- Arrays - Ordered lists of values where every value in the array is a token.
For example, take the following JSONB
value in column person
:
{
"firstName": "John",
"lastName": "Smith",
"age": 25,
"address": {
"state": "NY",
"postalCode": "10021"
},
"cars": [
"Subaru",
"Honda"
]
}
An inverted index for this object would have an entry per component, mapping it back to the original object:
"firstName": "John"
"lastName": "Smith"
"age": 25
"address": "state": "NY"
"address": "postalCode": "10021"
"cars" : "Subaru"
"cars" : "Honda"
This lets you search based on subcomponents.
Creation
You can use inverted indexes to improve the performance of queries using JSONB
or ARRAY
columns. You can create them:
- At the same time as the table with the
INVERTED INDEX
clause ofCREATE TABLE
. - For existing tables with
CREATE INVERTED INDEX
. Using the following PostgreSQL-compatible syntax:
> CREATE INDEX <optional name> ON <table> USING GIN (<column>);
Selection
If a query contains a filter against an indexed JSONB
or ARRAY
column that uses any of the supported operators, the inverted index is added to the set of index candidates.
Because each query can use only a single index, CockroachDB selects the index it calculates will scan the fewest rows (i.e., the fastest). For more detail, check out our blog post Index Selection in CockroachDB.
To override CockroachDB's index selection, you can also force queries to use a specific index (also known as "index hinting") or use an inverted join hint.
Storage
CockroachDB stores indexes directly in your key-value store. You can find more information in our blog post Mapping Table Data to Key-Value Storage.
Locking
Tables are not locked during index creation thanks to CockroachDB's schema change procedure.
Performance
Indexes create a trade-off: they greatly improve the speed of queries, but slightly slow down writes (because new values have to be copied and sorted). The first index you create has the largest impact, but additional indexes only introduce marginal overhead.
Comparisons
JSONB
Inverted indexes on JSONB
columns support the following comparison operators:
- "is contained by":
<@
- "contains":
@>
"equals":
=
, but only when you've reached into the JSON document with the->
operator. For example:> SELECT * FROM a WHERE j ->'foo' = '"1"';
This is equivalent to using
@>
:> SELECT * FROM a WHERE j @> '{"foo": "1"}';
If you require comparisons using <
), <=
, etc., you can create an index on a computed column using your JSON payload, and then create a regular index on that. So if you wanted to write a query where the value of "foo" is greater than three, you would:
Create your table with a computed column:
> CREATE TABLE test ( id INT, data JSONB, foo INT AS ((data->>'foo')::INT) STORED );
Create an index on your computed column:
> CREATE INDEX test_idx ON test (foo);
Execute your query with your comparison:
> SELECT * FROM test where foo > 3;
Arrays
Inverted indexes on ARRAY
columns support the following comparison operators:
Partial inverted indexes
New in v21.1: You can create a partial inverted index, an inverted index on a subset of JSON
, ARRAY
, or geospatial container column data. Just like partial indexes that use non-container data types, create a partial inverted index by including a clause that evaluates to true on a boolean predicate, like a WHERE
clause.
CREATE TABLE test (
id INT,
data JSONB,
INVERTED INDEX idx_data(data) WHERE id > 10
);
Examples
Create a table with inverted index on a JSONB column
In this example, let's create a table with a JSONB
column and an inverted index:
> CREATE TABLE users (
profile_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
last_updated TIMESTAMP DEFAULT now(),
user_profile JSONB,
INVERTED INDEX user_details (user_profile)
);
Then, insert a few rows of data:
> INSERT INTO users (user_profile) VALUES
('{"first_name": "Lola", "last_name": "Dog", "location": "NYC", "online" : true, "friends" : 547}'),
('{"first_name": "Ernie", "status": "Looking for treats", "location" : "Brooklyn"}'),
('{"first_name": "Carl", "last_name": "Kimball", "location": "NYC", "breed": "Boston Terrier"}'
);
> SELECT *, jsonb_pretty(user_profile) FROM users;
+--------------------------------------+----------------------------------+--------------------------------------------------------------------------+------------------------------------+
| profile_id | last_updated | user_profile | jsonb_pretty |
+--------------------------------------+----------------------------------+--------------------------------------------------------------------------+------------------------------------+
| 81330a51-80b2-44aa-b793-1b8d84ba69c9 | 2018-03-13 18:26:24.521541+00:00 | {"breed": "Boston Terrier", "first_name": "Carl", "last_name": | { |
| | | "Kimball", "location": "NYC"} | |
| | | | "breed": "Boston Terrier", |
| | | | "first_name": "Carl", |
| | | | "last_name": "Kimball", |
| | | | "location": "NYC" |
| | | | } |
| 81c87adc-a49c-4bed-a59c-3ac417756d09 | 2018-03-13 18:26:24.521541+00:00 | {"first_name": "Ernie", "location": "Brooklyn", "status": "Looking for | { |
| | | treats"} | |
| | | | "first_name": "Ernie", |
| | | | "location": "Brooklyn", |
| | | | "status": "Looking for treats" |
| | | | } |
| ec0a4942-b0aa-4a04-80ae-591b3f57721e | 2018-03-13 18:26:24.521541+00:00 | {"first_name": "Lola", "friends": 547, "last_name": "Dog", "location": | { |
| | | "NYC", "online": true} | |
| | | | "first_name": "Lola", |
| | | | "friends": 547, |
| | | | "last_name": "Dog", |
| | | | "location": "NYC", |
| | | | "online": true |
| | | | } |
+--------------------------------------+----------------------------------+--------------------------------------------------------------------------+------------------------------------+
(3 rows)
Now, run a query that filters on the JSONB
column:
> SELECT * FROM users where user_profile @> '{"location":"NYC"}';
+--------------------------------------+----------------------------------+--------------------------------------------------------------------------+
| profile_id | last_updated | user_profile |
+--------------------------------------+----------------------------------+--------------------------------------------------------------------------+
| 81330a51-80b2-44aa-b793-1b8d84ba69c9 | 2018-03-13 18:26:24.521541+00:00 | {"breed": "Boston Terrier", "first_name": "Carl", "last_name": |
| | | "Kimball", "location": "NYC"} |
| ec0a4942-b0aa-4a04-80ae-591b3f57721e | 2018-03-13 18:26:24.521541+00:00 | {"first_name": "Lola", "friends": 547, "last_name": "Dog", "location": |
| | | "NYC", "online": true} |
+--------------------------------------+----------------------------------+--------------------------------------------------------------------------+
(2 rows)
Add an inverted index to a table with an array column
In this example, let's create a table with an ARRAY
column first, and add the inverted index later:
> CREATE TABLE students (
student_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
marks INT ARRAY
);
Insert a few rows of data:
> INSERT INTO students (marks) VALUES
(ARRAY[10,20,50]),
(ARRAY[20,40,100]),
(ARRAY[100,20,70]
);
> SELECT * FROM students;
+--------------------------------------+--------------+
| student_id | marks |
+--------------------------------------+--------------+
| 11cdc77c-2f12-48d4-8bb4-ddee7c705e00 | {10,20,50} |
| | |
| 2526c746-0b32-4f6b-a2b4-7ce6d411c1c2 | {20,40,100} |
| | |
| eefdc32e-4485-45ca-9df1-80c0f42d73c0 | {100,20,70} |
| | |
+--------------------------------------+--------------+
(3 rows)
Now, let’s add an inverted index to the table and run a query that filters on the ARRAY
:
> CREATE INVERTED INDEX student_marks ON students (marks);
> SELECT * FROM students where marks @> ARRAY[100];
+--------------------------------------+--------------+
| student_id | marks |
+--------------------------------------+--------------+
| 2526c746-0b32-4f6b-a2b4-7ce6d411c1c2 | {20,40,100} |
| | |
| eefdc32e-4485-45ca-9df1-80c0f42d73c0 | {100,20,70} |
| | |
+--------------------------------------+--------------+
(2 rows)
Create a table with a partial inverted index on a JSONB column
In the same users
table in the previous example create a partial inverted index for online users.
CREATE INVERTED INDEX idx_online_users ON users(user_profile) WHERE user_profile -> 'online' = 'true';
SELECT * FROM users WHERE user_profile -> 'online' = 'true';
profile_id | last_updated | user_profile
---------------------------------------+-------------------------------------+------------------------------------------------------------------------------------------------
b6df0cae-d619-4a08-ab4f-2815da7b981f | 2021-04-13 20:54:35.660734+00:00:00 | {"first_name": "Lola", "friends": 547, "last_name": "Dog", "location": "NYC", "online": true}
(1 row)
Time: 2ms total (execution 2ms / network 0ms)
Now, use index hinting with the partial inverted index.
SELECT * FROM users@idx_online_users WHERE user_profile->'online' = 'true' AND user_profile->'location' = '"NYC"';
profile_id | last_updated | user_profile
---------------------------------------+-------------------------------------+------------------------------------------------------------------------------------------------
ea1db57e-51c3-449d-b928-adab11191085 | 2021-04-14 20:45:39.960443+00:00:00 | {"first_name": "Lola", "friends": 547, "last_name": "Dog", "location": "NYC", "online": true}
(1 row)
Time: 2ms total (execution 2ms / network 0ms)
Inverted join examples
To run these examples, initialize a demo cluster with the MovR workload.
Start the MovR database on a 3-node CockroachDB demo cluster with a larger data set.
cockroach demo movr --num-histories 250000 --num-promo-codes 250000 --num-rides 125000 --num-users 12500 --num-vehicles 3750 --nodes 3
Create an inverted index on the vehicles
table's ext
column.
CREATE INVERTED INDEX idx_vehicle_details ON vehicles(ext);
Check the statement plan for a SELECT
statement that uses an inner inverted join.
EXPLAIN SELECT * FROM vehicles@primary AS v2 INNER INVERTED JOIN vehicles@idx_vehicle_details AS v1 ON v1.ext @> v2.ext;
info
-------------------------------------------------------------------------------------------
distribution: full
vectorized: true
• lookup join
│ table: vehicles@primary
│ equality: (city, id) = (city,id)
│ equality cols are key
│ pred: ext @> ext
│
└── • inverted join
│ table: vehicles@idx_vehicle_details
│
└── • scan
estimated row count: 3,750 (100% of the table; stats collected 3 minutes ago)
table: vehicles@primary
spans: FULL SCAN
(16 rows)
Time: 1ms total (execution 1ms / network 0ms)
You can omit the INNER INVERTED JOIN
statement by putting v1.ext
on the left side of a @>
join condition in a WHERE
clause and using an index hint for the inverted index.
EXPLAIN SELECT * FROM vehicles@idx_vehicle_details AS v1, vehicles AS v2 WHERE v1.ext @> v2.ext;
info
--------------------------------------------------------------------------------------------
distribution: full
vectorized: true
• lookup join
│ table: vehicles@primary
│ equality: (city, id) = (city,id)
│ equality cols are key
│ pred: ext @> ext
│
└── • inverted join
│ table: vehicles@idx_vehicle_details
│
└── • scan
estimated row count: 3,750 (100% of the table; stats collected 12 minutes ago)
table: vehicles@primary
spans: FULL SCAN
(16 rows)
Time: 1ms total (execution 1ms / network 0ms)
Use the LEFT INVERTED JOIN
hint to perform a left inverted join.
EXPLAIN SELECT * FROM vehicles AS v2 LEFT INVERTED JOIN vehicles AS v1 ON v1.ext @> v2.ext;
info
--------------------------------------------------------------------------------------------
distribution: full
vectorized: true
• lookup join (left outer)
│ table: vehicles@primary
│ equality: (city, id) = (city,id)
│ equality cols are key
│ pred: ext @> ext
│
└── • inverted join (left outer)
│ table: vehicles@idx_vehicle_details
│
└── • scan
estimated row count: 3,750 (100% of the table; stats collected 16 minutes ago)
table: vehicles@primary
spans: FULL SCAN
(16 rows)
Time: 2ms total (execution 2ms / network 0ms)