move the work into the application

move the job into the database

adding holes to the numbers

Multiple Rows at once

Performance

易栈网-膘叔

move the work into the application

move the job into the database

adding holes to the numbers

Multiple Rows at once

Performance

Equal Distribution

ORDER BY RAND()

Equal Distribution

昨天所说的power shell的PDF

顶想招聘

日志分类

热门标签

日志归档

搜索文章

最新评论

博客信息

友情链接

本文作者好象是写mysql proxy的，这是他写的关于order by rand的效率方面的文章。事实上我们在使用中是尽量避免采用order by rand这种的，但应该如何写，他给了我们一些其他的解决方法，并将其效率进行了分析，我觉得很有用，就留下来了。

原文网址：http://jan.kneschke.de/2007/2/15/order-by-rand

内容如下：
If you read the MySQL manual you might have seen the ORDER BY RAND() to randomize the the rows and using the LIMIT 1 to just take one of the rows.

SELECT name
  FROM random
 ORDER BY RAND()
 LIMIT 1;

This example works fine and is fast if you only when let's say 1000 rows. As soon as you have 10000 rows the overhead for sorting the rows becomes important. Don't forget: we only sort to throw nearly all the rows away.

I never liked it. And there are better ways to do it. Without a sorting. As long as we have a numeric primary key.

For the first examples we assume the be ID is starting at 1 and we have no holes between 1 and the maximum value of the ID.

First idea: We can simplify the whole job if we calculate the ID beforehand in the application.

SELECT MAX(id) FROM random;
## generate random id in application
SELECT name FROM random WHERE id = <random-id>

As MAX(id) == COUNT(id) we just generate random number between 1 and MAX(id) and pass it into the database to retrieve the random row.

The first SELECT is a NO-OP and is optimized away. The second is a eq_ref against a constant value and also very fast.

But is it really necessary to do it in the application ? Can't we do it in the database ?

# generating a random ID
> SELECT RAND() * MAX(id) FROM random;
+------------------+
| RAND() * MAX(id) |
+------------------+
|  689.37582507297 |
+------------------+
# oops, this is a double, we need an int

> SELECT CEIL(RAND() * MAX(id)) FROM random;
+-------------------------+
| CEIL(RAND() * MAX(id)) |
+-------------------------+
|                    1000000  |
+-------------------------+
# better. But how is the performance:

> EXPLAIN 
   SELECT CEIL(RAND() * MAX(id)) FROM random;
+----+-------------+-------+-------+------+-------------+
| id | select_type | table | type  | rows | Extra       |
+----+-------------+-------+-------+------+-------------+
|  1 | SIMPLE      | random  | index | 1000000  | Using index |
+----+-------------+-------+-------+------+-------------+
## a index scan ? we lost our optimization for the MAX()

> EXPLAIN 
   SELECT CEIL(RAND() * (SELECT MAX(id) FROM random));
+----+-------------+-------+------+------+------------------------------+
| id | select_type | table | type | rows | Extra                        |
+----+-------------+-------+------+------+------------------------------+
|  1 | PRIMARY     | NULL  | NULL | NULL | No tables used               |
|  2 | SUBQUERY    | NULL  | NULL | NULL | Select tables optimized away |
+----+-------------+-------+------+------+------------------------------+
## a simple Sub-Query is bringing us the performance back.

Ok, now we know how to generate the random ID, but how to get the row ?

> EXPLAIN
SELECT name
  FROM random
 WHERE id = (SELECT CEIL(RAND() *
                         (SELECT MAX(id)
                            FROM random));
+----+-------------+--------+------+---------------+------+---------+------+---------+------------------------------+
| id | select_type | table  | type | possible_keys | key  | key_len | ref  | rows    | Extra                        |
+----+-------------+--------+------+---------------+------+---------+------+---------+------------------------------+
|  1 | PRIMARY     | random | ALL  | NULL          | NULL | NULL    | NULL | 1000000 | Using where                  |
|  3 | SUBQUERY    | NULL   | NULL | NULL          | NULL | NULL    | NULL |    NULL | Select tables optimized away |
+----+-------------+--------+------+---------------+------+---------+------+---------+------------------------------+
> show warnings;
+-------+------+------------------------------------------+
| Level | Code | Message                                  |
+-------+------+------------------------------------------+
| Note  | 1249 | Select 2 was reduced during optimization |
+-------+------+------------------------------------------+

NO, NO, NO. Don't go this way. This is the most obvious, but also the most wrong way to do it. The reason: the SELECT in the WHERE clause is executed for every row the outer SELECT is fetching. This leads to 0 to 4091 rows, depending on your luck.

We need a way to make sure that the random-id is only generated once:

SELECT name
  FROM random JOIN
       (SELECT CEIL(RAND() *
                    (SELECT MAX(id)
                       FROM random)) AS id
        ) AS r2
       USING (id);
+----+-------------+------------+--------+------+------------------------------+
| id | select_type | table      | type   | rows | Extra                        |
+----+-------------+------------+--------+------+------------------------------+
|  1 | PRIMARY     | <derived2> | system |    1 |                              |
|  1 | PRIMARY     | random     | const  |    1 |                              |
|  2 | DERIVED     | NULL       | NULL   | NULL | No tables used               |
|  3 | SUBQUERY    | NULL       | NULL   | NULL | Select tables optimized away |
+----+-------------+------------+--------+------+------------------------------+

The inner SELECT is generating a constant TEMPORARY table and the JOIN is selecting just on single row. Perfect.

No Sorting, No Application, Most parts of the query optimized away.

To generalize the last solution we add the possibility of holes, like when you DELETE rows.

SELECT name
  FROM random AS r1 JOIN
       (SELECT (RAND() *
                     (SELECT MAX(id)
                        FROM random)) AS id)
        AS r2
 WHERE r1.id >= r2.id
 ORDER BY r1.id ASC
 LIMIT 1;
+----+-------------+------------+--------+------+------------------------------+
| id | select_type | table      | type   | rows | Extra                        |
+----+-------------+------------+--------+------+------------------------------+
|  1 | PRIMARY     | <derived2> | system |    1 |                              |
|  1 | PRIMARY     | r1         | range  |  689 | Using where                  |
|  2 | DERIVED     | NULL       | NULL   | NULL | No tables used               |
|  3 | SUBQUERY    | NULL       | NULL   | NULL | Select tables optimized away |
+----+-------------+------------+--------+------+------------------------------+

The JOIN now adds all the IDs which are greater or equal than our random value and selects only the direct neighboor if a direct match is not possible. BUT as soon as one row is found, we stop (the LIMIT 1). And we read the rows according to the index (ORDER BY id ASC). As we are using >= instead of a = we can get rid of a the CEIL and get the same result with a little less work.

As soon as the distribution of the IDs is not equal anymore our selection of rows isn't really random either.

> select * from holes;
+----+----------------------------------+----------+
| id | name                             | accesses |
+----+----------------------------------+----------+
|  1 | d12b2551c6cb7d7a64e40221569a8571 |      107 |
|  2 | f82ad6f29c9a680d7873d1bef822e3e9 |       50 |
|  4 | 9da1ed7dbbdcc6ec90d6cb139521f14a |      132 |
|  8 | 677a196206d93cdf18c3744905b94f73 |      230 |
| 16 | b7556d8ed40587a33dc5c449ae0345aa |      481 |
+----+----------------------------------+----------+

The RAND function is generating IDs like 9 to 15 which all lead to the id 16 to be selected as the next higher number.

There is no real solution for this problem, but your data is mostly constant you can add a mapping table which maps the row-number to the id:

> create table holes_map ( row_id int not NULL primary key, random_id int not null);
> SET @id = 0;
> INSERT INTO holes_map SELECT @id := @id + 1, id FROM holes;
> select * from holes_map;
+--------+-----------+
| row_id | random_id |
+--------+-----------+
|      1 |         1 |
|      2 |         2 |
|      3 |         4 |
|      4 |         8 |
|      5 |        16 |
+--------+-----------+

The row_id is now free of holes again and we can run our random query again:

SELECT name FROM holes
  JOIN (SELECT r1.random_id
         FROM holes_map AS r1
         JOIN (SELECT (RAND() *
                      (SELECT MAX(row_id)
                         FROM holes_map)) AS row_id)
               AS r2
        WHERE r1.row_id >= r2.row_id
        ORDER BY r1.row_id ASC
        LIMIT 1) as rows ON (id = random_id);

After 1000 fetches we see a equal distribution again:

> select * from holes;
+----+----------------------------------+----------+
| id | name                             | accesses |
+----+----------------------------------+----------+
|  1 | d12b2551c6cb7d7a64e40221569a8571 |      222 |
|  2 | f82ad6f29c9a680d7873d1bef822e3e9 |      187 |
|  4 | 9da1ed7dbbdcc6ec90d6cb139521f14a |      195 |
|  8 | 677a196206d93cdf18c3744905b94f73 |      207 |
| 16 | b7556d8ed40587a33dc5c449ae0345aa |      189 |
+----+----------------------------------+----------+

If you want to get more than one row returned, you can:

execute the Query several times
write a stored procedure which is executing the query and stores the result in a temp-table
(make a UNION)[http://jan.kneschke.de/2007/2/22/analyzing-complex-queries]

Now let's see what happends to our performance. We have 3 different queries for solving our problems.

Q1. ORDER BY RAND()
Q2. RAND() * MAX(ID)
Q3. RAND() * MAX(ID) + ORDER BY ID

Q1 is expected to cost N * log2(N), Q2 and Q3 are nearly constant.

The get real values we filled the table with N rows ( one thousand to one million) and executed each query 1000 times.

100        1.000      10.000     100.000    1.000.000
Q1  0:00.718s  0:02.092s  0:18.684s  2:59.081s  58:20.000s
Q2  0:00.519s  0:00.607s  0:00.614s  0:00.628s   0:00.637s
Q3  0:00.570s  0:00.607s  0:00.614s  0:00.628s   0:00.637s

As you can see the plain ORDER BY RAND() is already behind the optimized query at only 100 rows in the table.

A more detailed analysis of those queries is at analyzing-complex-queries.

顶想，流年，这些关键字是不是让你有什么熟悉的地方？topthink，流年的想法终于付诸实施了，虽然他并没有要求我发这个贴子和链接，不过，作为THINKPHP的用户，也希望更多的人走进来，试想fleaphp(QeePHP)也是以公司形式在运作，如果没有大量的人来做这些事，光靠个人，可能很难坚持下去。所以，我支持一下流年。。。

以下是招聘启示（顺便BS一下，居然用FLASH，害得我差点没法复制。。。）：

上海顶想信息科技有限公司致力于ThinkPHP的运作推广、支持和项目以及产品开发。
为了ThinkPHP的长足发展和项目需要，顶想信息科技公开招聘官方技术团队和PHP开发精英。
无论你是PHP菜鸟还是高手，只要你对ThinkPHP充满热情，都欢迎加入我们的团队。
我们追求的是快乐，做你真正喜欢做的事。
我们追求的是效率，做正确的事而不总是加班。
我们追求的是尊重，人是我们最大的财富。
我们追求的是成长，发现，培养，和保留最优秀的人。
如果你想追随自己的梦想，如果你想超越不可能，如果你想达到人生的理想，请考虑一下这个机会！有超过10w的用户在使用ThinkPHP，做为ThinkPHP官方的核心团队成员，你将会伴随着TP一起成长。我们希望你在这个过程中能成长为一名经验丰富的项目负责人，能够独当一面，开发大型创新性项目。我们也希望你能够同时培养起较强的商业直觉和沟通管理才能。

我们的人才观
1. 以人为本，唯才是用
2. 能者居之，能进能出
3. 德才兼备，共同发展

我们需要什么样的人才？
1、有实际WEB2.0相关项目经验的PHP开发人员；
2、进行TP的文档编写、示例制作、宣传推广和培训支持的技术支持人员；
3、熟悉其他开发框架、对框架开发有强烈兴趣的开发人员

基本要求：
1、热爱ThinkPHP
2、熟悉PHP面向对象和MVC开发
3、良好的编程习惯和思想
4、独立的开发能力
5、有ThinkPHP项目开发经验优先
工资待遇：3k~6k

公司简介：上海顶想信息科技有限公司，主要从事ThinkPHP的运作推广、支持和项目以及产品开发。
工作地点：上海徐汇区钦州路108弄梓树园6号楼203室

有意向请投简历到：hr@topthink.com 或者联系QQ:130770305（请注明应聘）

顺便说一下，离我现在的单位真的不远呀。。。。看来以后有机会跑过去蹭饭了

« 2025年12月 »