前段时间放了一篇张宴关于GOOGLE架构的文章,张宴文中引用了google docs分享的一个PDF,然后我就全文复制过来了。
结果N个人和我抱怨说IE打开就死了。
因为在看完文章后,我去GOOGLE docs进行一下试用,它会把pdf,ppt都会搞类类似于幻灯片的方式共享,而且引用方法是采用了iframe
当朋友们和我抱怨时,我就在猜测,是不是这个iframe导致的。
于是,把内容放到文章页后,正常了。oh yeah...
这是一个比较老的分词程序,原文中的一些链接现在不是地址不正确就是打不开了。由此可以证明它是多老了。
再加上PHP直接进行分词的性能本来就不咋地,因此,建议仅仅用在很小的地方,比如自动添加TAG之类的。
原文如下:http://blog.sina.com.cn/s/blog_5677bc54010000i5.html
用PHP去做中文分词并不是一个太明智的举动, :p
下面是我根据网上找的一个字典档, 简易实现的一个分词程序.
(注: 字典档是gdbm格式, key是词 value是词频, 约4万个常用词)
代码请参见http://www.shi8.com/out/support/art_316.txt
PHP代码
- <?php
-
-
-
-
-
-
- function getmicrotime(){
- list($usec, $sec) = explode(" ",microtime());
- return ((float)$usec + (float)$sec);
- }
- $time_start = getmicrotime();
-
-
-
- class ch_dictionary {
- var $_id;
-
- function ch_dictionary($fname = "") {
- if ($fname != "") {
- $this->load($fname);
- }
- }
-
-
- function load($fname) {
- $this->_id = dba_popen($fname, "r", "gdbm");
- if (!$this->_id) {
- echo "failed to open the dictionary.($fname)<br>\n";
- exit;
- }
- }
-
-
- function find($word) {
- $freq = dba_fetch($word, $this->_id);
- if (is_bool($freq)) $freq = -1;
- return $freq;
- }
- }
-
-
-
- class ch_word_split {
- var $_mb_mark_list;
- var $_word_maxlen;
- var $_dic;
- var $_ignore_mark;
-
- function ch_word_split () {
- $this->_mb_mark_list = array(","," ","。","!","?",":","……","、","“","”","《","》","(",")");
- $this->_word_maxlen = 12;
- $this->_dic = NULL;
- $this->_ignore_mark = true;
- }
-
-
- function set_dic($fname) {
- $this->_dic = new ch_dictionary($fname);
- }
-
- function set_ignore_mark($set) {
- if (is_bool($set)) $this->_ignore_mark = $set;
- }
-
-
- function string_split($str, $func = "") {
- $ret = array();
-
- if ($func == "" || !function_exists($func)) $func = "";
-
- $len = strlen($str);
- $qtr = "";
-
- for ($i = 0; $i < $len; $i++) {
- $char = $str[$i];
-
- if (ord($char) < 0xa1) {
-
- if (!emptyempty($qtr)) {
- $tmp = $this->_sen_split($qtr);
- $qtr = "";
-
- if ($func != "") call_user_func($func, $tmp);
- else $ret = array_merge($ret, $tmp);
- }
-
-
- if ($this->_is_alnum($char)) {
- do {
- if (($i+1) >= $len) break;
- $char2 = substr($str, $i + 1, 1);
- if (!$this->_is_alnum($char2)) break;
-
- $char .= $char2;
- $i++;
- } while (1);
-
- if ($func != "") call_user_func($func, array($char));
- else $ret[] = $char;
- }
- elseif ($char == ' ' || $char == "\t") {
-
- continue;
- }
- elseif (!$this->_ignore_mark) {
- if ($func != "") call_user_func($func, array($char));
- else $ret[] = $char;
- }
- }
- else {
-
- $i++;
- $char .= $str[$i];
-
- if (in_array($char, $this->_mb_mark_list)) {
- if (!emptyempty($qtr)) {
- $tmp = $this->_sen_split($qtr);
- $qtr = "";
-
- if ($func != "") call_user_func($func, $tmp);
- else $ret = array_merge($ret, $tmp);
- }
-
- if (!$this->_ignore_mark) {
- if ($func != "") call_user_func($func, array($char));
- else $ret[] = $char;
- }
- }
- else {
- $qtr .= $char;
- }
- }
- }
-
- if (strlen($qtr) > 0) {
- $tmp = $this->_sen_split($qtr);
-
- if ($func != "") call_user_func($func, $tmp);
- else $ret = array_merge($ret, $tmp);
- }
-
-
- if ($func == "") {
- return $ret;
- }
- else {
- return true;
- }
- }
-
-
- function _sen_split($sen) {
- $len = strlen($sen) / 2;
- $ret = array();
-
- for ($i = $len - 1; $i >= 0; $i--) {
-
-
-
- $w = substr($sen, $i * 2, 2);
-
-
- $wlen = 1;
-
-
- $lf = 0;
- for ($j = 1; $j <= $this->_word_maxlen; $j++) {
- $o = $i - $j;
- if ($o < 0) break;
- $w2 = substr($sen, $o * 2, ($j + 1) * 2);
-
- $tmp_f = $this->_dic->find($w2);
-
- if ($tmp_f > $lf) {
- $lf = $tmp_f;
- $wlen = $j + 1;
- $w = $w2;
- }
- }
-
- $i = $i - $wlen + 1;
- array_push($ret, $w);
- }
-
- $ret = array_reverse($ret);
- return $ret;
- }
-
-
- function _is_alnum($char) {
- $ord = ord($char);
- if ($ord == 45 || $ord == 95 || ($ord >= 48 && $ord <= 57))
- return true;
- if (($ord >= 97 && $ord <= 122) || ($ord >= 65 && $ord <= 90))
- return true;
- return false;
- }
- }
-
-
-
- function call_back($ar) {
- foreach ($ar as $tmp) {
- echo $tmp . " ";
-
- }
- }
-
-
- $wp = new ch_word_split();
- $wp->set_dic("dic.db");
-
- if (!isset($_REQUEST['testdat']) || emptyempty($_REQUEST['testdat'])) {
- $data = file_get_contents("sample.txt");
- }
- else {
- $data = & $_REQUEST['testdat'];
- }
-
-
- echo "<h3>简易分词演示</h3>\n";
- echo "<hr>\n";
- echo "分词结果(" . strlen($data) . " chars): <br>\n<textarea cols=100 rows=10>\n";
-
-
- $wp->set_ignore_mark(false);
-
-
- $wp->string_split($data, "call_back");
-
- $time_end = getmicrotime();
- $time = $time_end - $time_start;
-
- echo "</textarea><br>\n本次分词耗时: $time seconds <br>\n";
- ?>
- <hr>
- <form method=post>
- 您也可以在下面文本框中输入文字,提交后试验分词效果:<br>
- <textarea name=testdat cols=100 rows=10></textarea><br>
- <input type=submit>
- </form>
- <hr>
文章引用自:http://www.im286.net/viewthread.php?tid=1157015