一些建议： -holdon(again) @佛州华人论坛:佛州枫下论坛 The Rolia Forum of Florida

工作学习 / 学科技术讨论 / 感谢大家的回答. 接受批评, 再重写一次. SQL问题, 谢谢! -mynewproject222(mynewproject222); 2008-6-21 {1312} (#4515695@0)
本文发表在 rolia.net 枫下论坛Here is a question on how to combine keywords quickly.

Assume that following is a small part of a large static text file (we call it "key_file") that contains nearly one million lines.

30234804 highestpaysurveys.com
30234805 highestpaysurveys.org
30234806 highpaysurveys
30234807 highpaysurveys.com
30234808 hoem based
30234809 home automation market research
30234810 home based accounting jobs
30234811 home based employment
30234812 home based extra income
30234813 home based free jobs
30234814 home based internet
30234816 home based internet work
30234817 home based jobs

Now assume that I have keywords "home", "based", "free", "jobs" (this is a "keyword set"), how can I quickly combine them to get

30234813 home based free jobs
30234817 home based jobs

The principle is that the keyword records from the "key_file" are retrieved if and only if they occur in the keyword_set. Case and order of keywords are all ignored. For example, assume that the keyword_file aslo contains following
9999999 based home free jobs
then this keyword record should also be retrieved given the "keyword_set".
This is not database and there is no tables here. All I need to write C code to finish this project. So a good algorithm is really important.

Thanks so much!更多精彩文章及讨论，请光临枫下论坛 rolia.net

实际这是个算法问题,谢谢! -mynewproject222(mynewproject222); 2008-6-22 (#4515734@0)
这要看你是经常查同一个文件，还是只是查一次。也就是说，查100次，平均最快和查一次最快的算法是不一样的。还有你查“home","based",“job"那么包不包括"homebased job", "home-based job","home based jobs". -niu1986(只吃草的牛); 2008-6-22 (#4515745@0)

This file will be checked many times. It is not just one time. For your question, given keywords "home", "based" and "job", then "homebased job", "home-based job" and "home based jobs" should be all retrieved. -mynewproject222(mynewproject222); 2008-6-22 (#4516275@0)

First, this is not a SQL question. second, you still didn't make it clear. say you have 4 keyword: home based free jobs, should the matched record has all the keywords? or 3 of 4? or 2 of 4? or only one keyword is enough? -holdon(again); 2008-6-22 {120} (#4515796@0)
I already told you to use awk to parse static txt file. you can google 'awk manual' to check how to write awk script.

Retrieve Rule and About AWK -mynewproject222(mynewproject222); 2008-6-22 {718} (#4516278@0)
A recored is matched if and only if each words in the record can be found in the keyword set.
Take my old example. Given 4 keywords: home, based, free, jobs, then a record is said to "match" if and only if this record contains the 4 keywords. Order is ignored. We assume that all keywords and records are in lower case.

For awk: what do you mean exactly by using awk to parse static .txt file? awk is just a tool. We have many ways to parse this file. What's the PURPOSE of this? Do you mean to parse and load the txt file? Or do you mean to search directly with awk scripts? After parsing the file, what do you do next? How do you do the combining step?

Thanks so much! I really appreciate your replies!

if all 4 keywords are required, then "30234817 home based jobs" is not a match. awk is perfect tool to do this kind of job. for 1M lines text file, awk should be able to easily find all the records in 30 seconds. Is this fast enough for you? -holdon(again); 2008-6-22 (#4516386@0)

Thanks so much. Here it is. -mynewproject222(mynewproject222); 2008-6-22 {355} (#4516801@0)
Thanks for your correction. The retrieve Rule is "A recored is matched if and only if each words in the record can be found in the keyword set." So 30234817 home based jobs" is still a match.

For the running time, I am looking for a much shorter time than 30 seconds as this is a Web-based system. It really has a high requirement for process time.

一些建议： -holdon(again); 2008-6-23 {1199} (#4517289@0)
本文发表在 rolia.net 枫下论坛1.我觉得你目前的需求还不明确。你应该先明确定义需求，什么样的记录满足条件，什么不满足。目标不要太高，别想做成“智能”搜索。单复数（JOBS, JOB),动名形容词副词，时态，复合词断词(homebased -> home based ), 这些考虑进去就没法做了，需要语言学家。你目前的定义 "A recored is matched if and only if each words in the record can be found in the keyword set. 应该是可行的。 homebased jobs 不满足你目前的定义。

2. 找个支持全文检索的数据库，把文本导入数据库。用数据库功能帮你匹配。

3.没有全文检索的数据库，大概也可以这么做：
预处理文本文件，找出所有关键字，按字母序存入keywords table:
id keywords
3 home
1 based
2 free
4 jobs
。。。

建另一张表 records:
id keywords_number first_keywords keywords_set text
22222 4 1 [1 2 3 4 ] "home based free jobs"

注意home-based 可以在预处理时断词成 home based:
22223 4 1 [1 2 3 4 ] "home-based free jobs"

在keywords_number 和first_keywords上减索引。

用户输入keywords,先找对应keywords.id, 找不到就肯定没有匹配记录。假如都有，比如 "home based free jobs" -> [1,2,3,4]

如果都有， select * from records where keywords_number<=4 and first_keywords in ( 1,2,3,4)

然后去比较keywords_set . 这样做的话，正常情况几秒钟应该够了。更多精彩文章及讨论，请光临枫下论坛 rolia.net
http://recr

Thanks so much. I think this is a good approach. -mynewproject222(mynewproject222); 2008-6-23 {226} (#4517744@0)
Unfortunately I do not have a database to use. All I have to do is to use code to finish this task. Therefore I am looking for a good algorithm that can do the match quickly.

Thanks again! I really appreciate your replies!

有什么工具就用什么方法。目前看来，你没有数据库，所以问题与SQL无关。既然你寻求纯C的解决方法，regular expression应该是最方便的方法？下面链接供你参考。 -bdbs(不得百失); 2008-6-23 (#4518163@0)

Thanks a lot! -mynewproject222(mynewproject222); 2008-6-24 (#4519317@0)

It is quite easy. Details inside -cerboros_redux(Cerboros Reborn); 2008-6-23 {528} (#4518246@0)
First sort the keyword set and assign bits to each keyword. For example,

0 bit -> home
1 bit -> based
2 bit-> free
3 bit -> jobs/job
...

Now tokenize each line. For every token, compare it with keywords in the keyword set to create an integer value. For example:

"30234810 home based accounting jobs"

matches 0, 1 and 3 bits. So the integer is 0x0b.

Each query creates an integer. Calculate the bit-operator 'and' using the query integer and the line result. If the result is 1, it matches, otherwise not.

this is a interesting solution and could work if the keywords set is small. -holdon(again); 2008-6-23 {1381} (#4518387@0)
本文发表在 rolia.net 枫下论坛since there are 1M records, my guess is the keywords could have maybe 1000 words? so it will need about 120 bytes to save this big integer.

no on each records, you will need to do "and" on 120 bytes, probably equal to compare blank string to a 120 bytes string?

if the keywords set has 2000 words, then its 240 bytes, 1M times.

so this solution only work for small keywords set records.

for big set keywords, I think what you need to do is like a small full text index engine.

1. save all keywords to keywords array.
k[1]=air
k[2]=home ...

2. create a tree to save all records:

level 1:root->

level2 :2keywords records,3keywords records, 4 keywords records...

level3: first_keyword is 1, first_keyword is 2,.....

level4: all the records as a keywords array.
2333927 [1 3 4 5 ]
2333433 [1 3 4 7]

3. when start q search, convert keywords to number, "home based free job whatever" become 2 4 7 9, whatever is not in keywords list so automatically removed.

4. search tree on level2 2,3,4 branch only (this might limit records to 0.5M) , lever 3 with first_keyword is 2 4 7 9. if there are about 100 different first_keyword, then the result record is about 0.5M *4/100 = 20,000 records.

5. compare the 20,000 arrays with [2 4 7 9] to see if it's a sub set, since it's all number comparing with order, should be fast than string.更多精彩文章及讨论，请光临枫下论坛 rolia.net

that is interesting吗？我看简直就是脱了什么放什么。 -bdbs(不得百失); 2008-6-23 (#4519066@0)
Thanks so much! -mynewproject222(mynewproject222); 2008-6-24 (#4519327@0)

Thanks! this lets me think about bit operation. -mynewproject222(mynewproject222); 2008-6-24 (#4519328@0)

Build the key words in the key_file into a tree: -ruex(xeur); 2008-6-23 {515} (#4519191@0)
home --> based => 30234808 (the leaf)
|----> employment
|----> extra
|----> free ---> jobs => 30234813
|----> internet
... ...

highpaysurveys => 30234806
|---> .com => 30234807

具体根据你的FILE内容, 确定怎样分NODE: 空格不一定是划分下一级NODE的唯一 / 最有效字符. 极端情况, 甚至可以每个字母分一级, 最多不过26个ROOT, 但是要确保每行都不太长, 也就是说树不会太深.

Thanks so much! The key_file has one million lines of records. -mynewproject222(mynewproject222); 2008-6-24 {268} (#4519338@0)
Creating such trees should be possible, however, traversing these trees may take long time even I create Binary Search Tree.

As this is a Webbased system, I need to retrieve related information within 40 milliseconds. Therefore a fast algorithm is really important.

I really doubt you can do it in 40 ms. 1M records probably need at least 10M RAM after converted to integer array.40ms isn't even enough to read 10M RAM. -holdon(again); 2008-6-24 {353} (#4519472@0)
also the algorithm is decided by the data, so before you decide the algorithm, if you already got the 1M records, you really should collect some statistic data first. how many keywords it has？ how many first_keyword and how many last_keyword? if you really want to make it as fast as possible, you might even need to care the frequency of each keyword.

For the 40ms, what I meant is the faster is the better... thanks for your reminding words. -mynewproject222(mynewproject222); 2008-6-24 (#4520167@0)

#4518246的方法是可以考虑改良一下的 -newkid(newkid); 2008-6-24 {834} (#4520099@0)
原贴说到了tokenize each line. 如果这是动态算的，那么比较结束了答案也出来了，没有必要再化成bit再与运算。如果是静态先算好的，那么如果key words很多，很多位的bit运算也是不现实的。

改良的办法是考虑一个26位的bitmap。也可把不常用的字母映射到其他的做成24位。
取各关键词的第N个字母算出一个数。比如home,based,free,jobs. 取h,b,f,j算出第一个数；o,a,r 第二个数(出现多次的字母按一次算)

这个静态的key_file先按上述办法预处理，每行记录映射成N个数, 事先排序。

当得到一个keyword set时，对这个keyword set进行处理得到若干个数，再去和key_file生成的数进行与操作。

为了减少与操作，和利用排序，可以把某些操作等价为比大小: 假如keyword set有某个数是 01101000, 那么比 01000000 大的，比1000小的肯定都不是，可以先过滤掉。在keyword set生成的若干个数中取范围最小的，再去扫描key_file生成的数组。

这个思路是为了尽量减少字符串比较，离最终目的还有距离。

因此我上面说这个方法是脱裤子放X多此一举。你这改良一下，就是早早的脱好了等着。：） -bdbs(不得百失); 2008-6-24 (#4520180@0)

你有点堕落了哈，以前不这么粗俗的。 -newkid(newkid); 2008-6-24 (#4520202@0)

呵呵，现在不是流行通俗么？楼下用厕所演示OO很形象生动嘛。 -bdbs(不得百失); 2008-6-24 (#4520275@0)

罚抄“纪念白求恩”100遍 -newkid(newkid); 2008-6-24 (#4520305@0)

这个，有关联么？ -bdbs(不得百失); 2008-6-24 (#4521030@0)

哈哈，一看就是没抄过课文的80后。这是为了帮你成为一个高尚的人，一个纯粹的人，一个有道德的人，一个脱离了低级趣味的人！ -newkid(newkid); 2008-6-25 (#4521745@0)

俺有那么年轻就好了。：）不过确实没抄过课文，因为从小都是好学生。 -bdbs(不得百失); 2008-6-25 (#4521850@0)

如果key set单词数很少，连与操作也可以转换为若干次比较。比如5个词就是2^5-1=31次寻址。这个方法不能告诉你哪些记录“是”它只能告诉你哪些“可能是”并快速排除那些“不是”的记录 -newkid(newkid); 2008-6-24 (#4520308@0)

Thanks for all your replies! Let me think over your replies and see what I can do next :) -mynewproject222(mynewproject222); 2008-6-24 (#4520178@0)
不是太明白你干嘛呢，不过无论是用perl，awk，还是导入到数据库里面都比用C重新造轮子强些。因为不管你怎么写，估计性能上绝对超不过这几个现成的tool. -canadiantire(轮胎 - Bona fide Crm); 2008-6-24 (#4520321@0)

@Florida

一些建议：

Replies, comments and Discussions:

More Topics