move the work into the application

move the job into the database

adding holes to the numbers

Multiple Rows at once

Performance

易栈网-膘叔

move the work into the application

move the job into the database

adding holes to the numbers

Multiple Rows at once

Performance

Equal Distribution

笔记：json_decode

ORDER BY RAND()

Equal Distribution

EditPlus的一些其他技巧

日志分类

热门标签

日志归档

搜索文章

最新评论

博客信息

友情链接

今天同事突然和我说，腾讯微博无法访问了。但新浪微博接口却是正常的。
打开指定的URL，发现报错，仔细看了一下代码，居然是
if(!is_array($jsonDecode)){
throw new Exception("xxxx");
}
当发现json解析出来不是数组的时候，直接抛出异常。
可是我把json字符串在decode前输出的时候是正常的，那是为什么呢？为什么我var_dump出来是null?
看了一下手册，居然说var_dump出来null的原因是因为deepth不够，但看了一下手册，好象也就三个取值范围，128,512,1024，于是我设为最大的1024，但仍然失败。

然后，突然发现，居然json_decode有第四个参数，把BIGINT当成字符串。以为是这个问题导致的，兴冲冲的把这个参数加上去，结果报错，看了一下手册，原来这个参数居然是PHP 5.4才有的。。。

----
再测试，把文本拷成字符串，运行json_decode，居然正常了。这时候我就纳闷了。为什么直接处理不正常。拷成字符串就正常了。
仔细检查。居然发现里面有中文被截断的字符串。晶，肯定是这个原因了。。。
于是写正则，把这玩意替换掉，把&#65\d{3};的字符串都替换成空格。然后再反处理回去。。结果仍然失败，这时候我就在想一个问题，为什么会有这种情况？

最终我确定，绝对是编码，一定是那个不正常的字符串导致了原来的json字符串被错误的识别成了ISO-8900-1，所以我啥也没处理，直接把json字符串mb_convert_encoding($str,"UTF-8","auto")，自动转换成utf-8，果然json_decode，顺利的转换成了数组。

---
纠结了一天的事情，居然就是这样的小，所幸解决了。

本文作者好象是写mysql proxy的，这是他写的关于order by rand的效率方面的文章。事实上我们在使用中是尽量避免采用order by rand这种的，但应该如何写，他给了我们一些其他的解决方法，并将其效率进行了分析，我觉得很有用，就留下来了。

原文网址：http://jan.kneschke.de/2007/2/15/order-by-rand

内容如下：
If you read the MySQL manual you might have seen the ORDER BY RAND() to randomize the the rows and using the LIMIT 1 to just take one of the rows.

SELECT name
  FROM random
 ORDER BY RAND()
 LIMIT 1;

This example works fine and is fast if you only when let's say 1000 rows. As soon as you have 10000 rows the overhead for sorting the rows becomes important. Don't forget: we only sort to throw nearly all the rows away.

I never liked it. And there are better ways to do it. Without a sorting. As long as we have a numeric primary key.

For the first examples we assume the be ID is starting at 1 and we have no holes between 1 and the maximum value of the ID.

First idea: We can simplify the whole job if we calculate the ID beforehand in the application.

SELECT MAX(id) FROM random;
## generate random id in application
SELECT name FROM random WHERE id = <random-id>

As MAX(id) == COUNT(id) we just generate random number between 1 and MAX(id) and pass it into the database to retrieve the random row.

The first SELECT is a NO-OP and is optimized away. The second is a eq_ref against a constant value and also very fast.

But is it really necessary to do it in the application ? Can't we do it in the database ?

# generating a random ID
> SELECT RAND() * MAX(id) FROM random;
+------------------+
| RAND() * MAX(id) |
+------------------+
|  689.37582507297 |
+------------------+
# oops, this is a double, we need an int

> SELECT CEIL(RAND() * MAX(id)) FROM random;
+-------------------------+
| CEIL(RAND() * MAX(id)) |
+-------------------------+
|                    1000000  |
+-------------------------+
# better. But how is the performance:

> EXPLAIN 
   SELECT CEIL(RAND() * MAX(id)) FROM random;
+----+-------------+-------+-------+------+-------------+
| id | select_type | table | type  | rows | Extra       |
+----+-------------+-------+-------+------+-------------+
|  1 | SIMPLE      | random  | index | 1000000  | Using index |
+----+-------------+-------+-------+------+-------------+
## a index scan ? we lost our optimization for the MAX()

> EXPLAIN 
   SELECT CEIL(RAND() * (SELECT MAX(id) FROM random));
+----+-------------+-------+------+------+------------------------------+
| id | select_type | table | type | rows | Extra                        |
+----+-------------+-------+------+------+------------------------------+
|  1 | PRIMARY     | NULL  | NULL | NULL | No tables used               |
|  2 | SUBQUERY    | NULL  | NULL | NULL | Select tables optimized away |
+----+-------------+-------+------+------+------------------------------+
## a simple Sub-Query is bringing us the performance back.

Ok, now we know how to generate the random ID, but how to get the row ?

> EXPLAIN
SELECT name
  FROM random
 WHERE id = (SELECT CEIL(RAND() *
                         (SELECT MAX(id)
                            FROM random));
+----+-------------+--------+------+---------------+------+---------+------+---------+------------------------------+
| id | select_type | table  | type | possible_keys | key  | key_len | ref  | rows    | Extra                        |
+----+-------------+--------+------+---------------+------+---------+------+---------+------------------------------+
|  1 | PRIMARY     | random | ALL  | NULL          | NULL | NULL    | NULL | 1000000 | Using where                  |
|  3 | SUBQUERY    | NULL   | NULL | NULL          | NULL | NULL    | NULL |    NULL | Select tables optimized away |
+----+-------------+--------+------+---------------+------+---------+------+---------+------------------------------+
> show warnings;
+-------+------+------------------------------------------+
| Level | Code | Message                                  |
+-------+------+------------------------------------------+
| Note  | 1249 | Select 2 was reduced during optimization |
+-------+------+------------------------------------------+

NO, NO, NO. Don't go this way. This is the most obvious, but also the most wrong way to do it. The reason: the SELECT in the WHERE clause is executed for every row the outer SELECT is fetching. This leads to 0 to 4091 rows, depending on your luck.

We need a way to make sure that the random-id is only generated once:

SELECT name
  FROM random JOIN
       (SELECT CEIL(RAND() *
                    (SELECT MAX(id)
                       FROM random)) AS id
        ) AS r2
       USING (id);
+----+-------------+------------+--------+------+------------------------------+
| id | select_type | table      | type   | rows | Extra                        |
+----+-------------+------------+--------+------+------------------------------+
|  1 | PRIMARY     | <derived2> | system |    1 |                              |
|  1 | PRIMARY     | random     | const  |    1 |                              |
|  2 | DERIVED     | NULL       | NULL   | NULL | No tables used               |
|  3 | SUBQUERY    | NULL       | NULL   | NULL | Select tables optimized away |
+----+-------------+------------+--------+------+------------------------------+

The inner SELECT is generating a constant TEMPORARY table and the JOIN is selecting just on single row. Perfect.

No Sorting, No Application, Most parts of the query optimized away.

To generalize the last solution we add the possibility of holes, like when you DELETE rows.

SELECT name
  FROM random AS r1 JOIN
       (SELECT (RAND() *
                     (SELECT MAX(id)
                        FROM random)) AS id)
        AS r2
 WHERE r1.id >= r2.id
 ORDER BY r1.id ASC
 LIMIT 1;
+----+-------------+------------+--------+------+------------------------------+
| id | select_type | table      | type   | rows | Extra                        |
+----+-------------+------------+--------+------+------------------------------+
|  1 | PRIMARY     | <derived2> | system |    1 |                              |
|  1 | PRIMARY     | r1         | range  |  689 | Using where                  |
|  2 | DERIVED     | NULL       | NULL   | NULL | No tables used               |
|  3 | SUBQUERY    | NULL       | NULL   | NULL | Select tables optimized away |
+----+-------------+------------+--------+------+------------------------------+

The JOIN now adds all the IDs which are greater or equal than our random value and selects only the direct neighboor if a direct match is not possible. BUT as soon as one row is found, we stop (the LIMIT 1). And we read the rows according to the index (ORDER BY id ASC). As we are using >= instead of a = we can get rid of a the CEIL and get the same result with a little less work.

As soon as the distribution of the IDs is not equal anymore our selection of rows isn't really random either.

> select * from holes;
+----+----------------------------------+----------+
| id | name                             | accesses |
+----+----------------------------------+----------+
|  1 | d12b2551c6cb7d7a64e40221569a8571 |      107 |
|  2 | f82ad6f29c9a680d7873d1bef822e3e9 |       50 |
|  4 | 9da1ed7dbbdcc6ec90d6cb139521f14a |      132 |
|  8 | 677a196206d93cdf18c3744905b94f73 |      230 |
| 16 | b7556d8ed40587a33dc5c449ae0345aa |      481 |
+----+----------------------------------+----------+

The RAND function is generating IDs like 9 to 15 which all lead to the id 16 to be selected as the next higher number.

There is no real solution for this problem, but your data is mostly constant you can add a mapping table which maps the row-number to the id:

> create table holes_map ( row_id int not NULL primary key, random_id int not null);
> SET @id = 0;
> INSERT INTO holes_map SELECT @id := @id + 1, id FROM holes;
> select * from holes_map;
+--------+-----------+
| row_id | random_id |
+--------+-----------+
|      1 |         1 |
|      2 |         2 |
|      3 |         4 |
|      4 |         8 |
|      5 |        16 |
+--------+-----------+

The row_id is now free of holes again and we can run our random query again:

SELECT name FROM holes
  JOIN (SELECT r1.random_id
         FROM holes_map AS r1
         JOIN (SELECT (RAND() *
                      (SELECT MAX(row_id)
                         FROM holes_map)) AS row_id)
               AS r2
        WHERE r1.row_id >= r2.row_id
        ORDER BY r1.row_id ASC
        LIMIT 1) as rows ON (id = random_id);

After 1000 fetches we see a equal distribution again:

> select * from holes;
+----+----------------------------------+----------+
| id | name                             | accesses |
+----+----------------------------------+----------+
|  1 | d12b2551c6cb7d7a64e40221569a8571 |      222 |
|  2 | f82ad6f29c9a680d7873d1bef822e3e9 |      187 |
|  4 | 9da1ed7dbbdcc6ec90d6cb139521f14a |      195 |
|  8 | 677a196206d93cdf18c3744905b94f73 |      207 |
| 16 | b7556d8ed40587a33dc5c449ae0345aa |      189 |
+----+----------------------------------+----------+

If you want to get more than one row returned, you can:

execute the Query several times
write a stored procedure which is executing the query and stores the result in a temp-table
(make a UNION)[http://jan.kneschke.de/2007/2/22/analyzing-complex-queries]

Now let's see what happends to our performance. We have 3 different queries for solving our problems.

Q1. ORDER BY RAND()
Q2. RAND() * MAX(ID)
Q3. RAND() * MAX(ID) + ORDER BY ID

Q1 is expected to cost N * log2(N), Q2 and Q3 are nearly constant.

The get real values we filled the table with N rows ( one thousand to one million) and executed each query 1000 times.

100        1.000      10.000     100.000    1.000.000
Q1  0:00.718s  0:02.092s  0:18.684s  2:59.081s  58:20.000s
Q2  0:00.519s  0:00.607s  0:00.614s  0:00.628s   0:00.637s
Q3  0:00.570s  0:00.607s  0:00.614s  0:00.628s   0:00.637s

As you can see the plain ORDER BY RAND() is already behind the optimized query at only 100 rows in the table.

A more detailed analysis of those queries is at analyzing-complex-queries.

昨天我转载了别人的一篇editpllus的技巧，现在我也将我常用的一些技巧放出来，算是和大家共享吧。
1、选中一列内容

选中一列内容在从excel等其他文件格式导成Txt的时候特别有用，比如我们不要当中某列数据的时候，确实有点痛苦，如果有规律，我们还能写正则进行替换，如果没有规则怎么办？
1.先关闭自动换行（菜单栏上有一个W，或者按快捷键ctrl+shift+w）
2.按住 alt键，用鼠标进行选择，你就会发现你选择的内容是屏幕中的某一块区域，而不是平时那种按照行来进行选择的了

2、autocomplete
对于我们写PHP代码的人来说，自动完成这个功能，很重要，有的变量经常用，但是打起来很繁，比如：$_POST;$_GET;$_REQUEST等，象这些变量，在写的时候要按着shift键才有用，如果经常写，我当然是希望有快捷的方式了。
1.点击菜单Tools->Preferences然后选中Files->setting&syntax，选择PHP
2.下面有两个选择的地方哦：setting&syntax和auto completion，这两个，一个是语法文件，一个是自动完成文件，有了这两个文件，你的代码就会被自动着色，函数和class等的着色就会和平常不一样，syntax文件官方有下载，但autocomplteion官方就没有了，不过你可以自己写，我就是自己写的。

格式如下：
#TITLE=PHP
#CASE=y

#T=h_401
header( 'HTTP/1.0 401 Unauthorized' );
exit();
#T=h_404
header( 'HTTP/1.0 404 Not Found' );
exit();

#T代表了你要输入的字符，下面的内容就是你输完后，在屏幕上输出的内容
比如，我输入h_401，在我按空格后，屏幕上就会立即跳出：

header( 'HTTP/1.0 401 Unauthorized' );
exit();

多方便呀。

顺便透露一下，htmlbar.acp其实就是一些快捷键的自定义哦，你也可以通过修改这个文件，达到对菜单的自定义。

3、页面函数显示
editplus太轻量级了，没有办法象IDE那样在打开某个文件后还能显示文件中的class和function，但也有一个简单的替代方法，那就是快捷键：ctrl+F11，你就可以快速查看页面中的function了，好象。。。。对public function aaa()这样的无效。这样的话也就是说PHP5文件中的function是看不到的。唉。没办法了。。。

先介绍这一点。功能有太多。慢慢来喽

« 2025年07月 »