SELECT name
FROM random
ORDER BY RAND()
LIMIT 1;
This example works fine and is fast if you only when let's say 1000 rows. As soon as you have 10000 rows the overhead for sorting the rows becomes important. Don't forget: we only sort to throw nearly all the rows away.
I never liked it. And there are better ways to do it. Without a sorting. As long as we have a numeric primary key.
For the first examples we assume the be ID is starting at 1 and we have no holes between 1 and the maximum value of the ID.
move the work into the application
First idea: We can simplify the whole job if we calculate the ID beforehand in the application.
SELECT MAX(id) FROM random;
## generate random id in application
SELECT name FROM random WHERE id = <random-id>
As MAX(id) == COUNT(id) we just generate random number between 1 and MAX(id) and pass it into the database to retrieve the random row.
The first SELECT is a NO-OP and is optimized away. The second is a eq_ref
against a constant value and also very fast.
move the job into the database
But is it really necessary to do it in the application ? Can't we do it in the database ?
# generating a random ID
> SELECT RAND() * MAX(id) FROM random;
+------------------+
| RAND() * MAX(id) |
+------------------+
| 689.37582507297 |
+------------------+
# oops, this is a double, we need an int
> SELECT CEIL(RAND() * MAX(id)) FROM random;
+-------------------------+
| CEIL(RAND() * MAX(id)) |
+-------------------------+
| 1000000 |
+-------------------------+
# better. But how is the performance:
> EXPLAIN
SELECT CEIL(RAND() * MAX(id)) FROM random;
+----+-------------+-------+-------+------+-------------+
| id | select_type | table | type | rows | Extra |
+----+-------------+-------+-------+------+-------------+
| 1 | SIMPLE | random | index | 1000000 | Using index |
+----+-------------+-------+-------+------+-------------+
## a index scan ? we lost our optimization for the MAX()
> EXPLAIN
SELECT CEIL(RAND() * (SELECT MAX(id) FROM random));
+----+-------------+-------+------+------+------------------------------+
| id | select_type | table | type | rows | Extra |
+----+-------------+-------+------+------+------------------------------+
| 1 | PRIMARY | NULL | NULL | NULL | No tables used |
| 2 | SUBQUERY | NULL | NULL | NULL | Select tables optimized away |
+----+-------------+-------+------+------+------------------------------+
## a simple Sub-Query is bringing us the performance back.
Ok, now we know how to generate the random ID, but how to get the row ?
> EXPLAIN
SELECT name
FROM random
WHERE id = (SELECT CEIL(RAND() *
(SELECT MAX(id)
FROM random));
+----+-------------+--------+------+---------------+------+---------+------+---------+------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+------+---------------+------+---------+------+---------+------------------------------+
| 1 | PRIMARY | random | ALL | NULL | NULL | NULL | NULL | 1000000 | Using where |
| 3 | SUBQUERY | NULL | NULL | NULL | NULL | NULL | NULL | NULL | Select tables optimized away |
+----+-------------+--------+------+---------------+------+---------+------+---------+------------------------------+
> show warnings;
+-------+------+------------------------------------------+
| Level | Code | Message |
+-------+------+------------------------------------------+
| Note | 1249 | Select 2 was reduced during optimization |
+-------+------+------------------------------------------+
NO, NO, NO. Don't go this way. This is the most obvious, but also the most wrong way to do it. The reason: the SELECT in the WHERE clause is executed for every row the outer SELECT is fetching. This leads to 0 to 4091 rows, depending on your luck.
We need a way to make sure that the random-id is only generated once:
SELECT name
FROM random JOIN
(SELECT CEIL(RAND() *
(SELECT MAX(id)
FROM random)) AS id
) AS r2
USING (id);
+----+-------------+------------+--------+------+------------------------------+
| id | select_type | table | type | rows | Extra |
+----+-------------+------------+--------+------+------------------------------+
| 1 | PRIMARY | <derived2> | system | 1 | |
| 1 | PRIMARY | random | const | 1 | |
| 2 | DERIVED | NULL | NULL | NULL | No tables used |
| 3 | SUBQUERY | NULL | NULL | NULL | Select tables optimized away |
+----+-------------+------------+--------+------+------------------------------+
The inner SELECT is generating a constant TEMPORARY table and the JOIN is selecting just on single row. Perfect.
No Sorting, No Application, Most parts of the query optimized away.
adding holes to the numbers
To generalize the last solution we add the possibility of holes, like when you DELETE
rows.
SELECT name
FROM random AS r1 JOIN
(SELECT (RAND() *
(SELECT MAX(id)
FROM random)) AS id)
AS r2
WHERE r1.id >= r2.id
ORDER BY r1.id ASC
LIMIT 1;
+----+-------------+------------+--------+------+------------------------------+
| id | select_type | table | type | rows | Extra |
+----+-------------+------------+--------+------+------------------------------+
| 1 | PRIMARY | <derived2> | system | 1 | |
| 1 | PRIMARY | r1 | range | 689 | Using where |
| 2 | DERIVED | NULL | NULL | NULL | No tables used |
| 3 | SUBQUERY | NULL | NULL | NULL | Select tables optimized away |
+----+-------------+------------+--------+------+------------------------------+
The JOIN now adds all the IDs which are greater or equal than our random value and selects only the direct neighboor if a direct match is not possible. BUT as soon as one row is found, we stop (the LIMIT 1
). And we read the rows according to the index (ORDER BY id ASC
). As we are using >=
instead of a =
we can get rid of a the CEIL
and get the same result with a little less work.
Equal Distribution
As soon as the distribution of the IDs is not equal anymore our selection of rows isn't really random either.
> select * from holes;
+----+----------------------------------+----------+
| id | name | accesses |
+----+----------------------------------+----------+
| 1 | d12b2551c6cb7d7a64e40221569a8571 | 107 |
| 2 | f82ad6f29c9a680d7873d1bef822e3e9 | 50 |
| 4 | 9da1ed7dbbdcc6ec90d6cb139521f14a | 132 |
| 8 | 677a196206d93cdf18c3744905b94f73 | 230 |
| 16 | b7556d8ed40587a33dc5c449ae0345aa | 481 |
+----+----------------------------------+----------+
The RAND
function is generating IDs like 9 to 15 which all lead to the id 16 to be selected as the next higher number.
There is no real solution for this problem, but your data is mostly constant you can add a mapping table which maps the row-number to the id:
> create table holes_map ( row_id int not NULL primary key, random_id int not null);
> SET @id = 0;
> INSERT INTO holes_map SELECT @id := @id + 1, id FROM holes;
> select * from holes_map;
+--------+-----------+
| row_id | random_id |
+--------+-----------+
| 1 | 1 |
| 2 | 2 |
| 3 | 4 |
| 4 | 8 |
| 5 | 16 |
+--------+-----------+
The row_id
is now free of holes again and we can run our random query again:
SELECT name FROM holes
JOIN (SELECT r1.random_id
FROM holes_map AS r1
JOIN (SELECT (RAND() *
(SELECT MAX(row_id)
FROM holes_map)) AS row_id)
AS r2
WHERE r1.row_id >= r2.row_id
ORDER BY r1.row_id ASC
LIMIT 1) as rows ON (id = random_id);
After 1000 fetches we see a equal distribution again:
> select * from holes;
+----+----------------------------------+----------+
| id | name | accesses |
+----+----------------------------------+----------+
| 1 | d12b2551c6cb7d7a64e40221569a8571 | 222 |
| 2 | f82ad6f29c9a680d7873d1bef822e3e9 | 187 |
| 4 | 9da1ed7dbbdcc6ec90d6cb139521f14a | 195 |
| 8 | 677a196206d93cdf18c3744905b94f73 | 207 |
| 16 | b7556d8ed40587a33dc5c449ae0345aa | 189 |
+----+----------------------------------+----------+
Multiple Rows at once
If you want to get more than one row returned, you can:
- execute the Query several times
- write a stored procedure which is executing the query and stores the result in a temp-table
- (make a UNION)[http://jan.kneschke.de/2007/2/22/analyzing-complex-queries]
Performance
Now let's see what happends to our performance. We have 3 different queries for solving our problems.
- Q1. ORDER BY RAND()
- Q2. RAND() * MAX(ID)
- Q3. RAND() * MAX(ID) + ORDER BY ID
Q1 is expected to cost N * log2(N), Q2 and Q3 are nearly constant.
The get real values we filled the table with N rows ( one thousand to one million) and executed each query 1000 times.
100 1.000 10.000 100.000 1.000.000
Q1 0:00.718s 0:02.092s 0:18.684s 2:59.081s 58:20.000s
Q2 0:00.519s 0:00.607s 0:00.614s 0:00.628s 0:00.637s
Q3 0:00.570s 0:00.607s 0:00.614s 0:00.628s 0:00.637s
As you can see the plain ORDER BY RAND() is already behind the optimized query at only 100 rows in the table.
A more detailed analysis of those queries is at analyzing-complex-queries.