Topic: 感谢大家的回答. 接受批评, 再重写一次. SQL问题, 谢谢! @佛州华人论坛:佛州枫下论坛 The Rolia Forum of Florida

工作学习 / 学科技术讨论 / 感谢大家的回答. 接受批评, 再重写一次. SQL问题, 谢谢! -mynewproject222(mynewproject222); 2008-6-21 {1312} (#4515695@0)

实际这是个算法问题,谢谢! -mynewproject222(mynewproject222); 2008-6-22 (#4515734@0)
这要看你是经常查同一个文件，还是只是查一次。也就是说，查100次，平均最快和查一次最快的算法是不一样的。还有你查“home","based",“job"那么包不包括"homebased job", "home-based job","home based jobs". -niu1986(只吃草的牛); 2008-6-22 (#4515745@0)

This file will be checked many times. It is not just one time. For your question, given keywords "home", "based" and "job", then "homebased job", "home-based job" and "home based jobs" should be all retrieved. -mynewproject222(mynewproject222); 2008-6-22 (#4516275@0)

First, this is not a SQL question. second, you still didn't make it clear. say you have 4 keyword: home based free jobs, should the matched record has all the keywords? or 3 of 4? or 2 of 4? or only one keyword is enough?I already told you to use awk to parse static txt file. you can google 'awk manual' to check how to write awk script. -holdon(again); 2008-6-22 {120} (#4515796@0)

Retrieve Rule and About AWK -mynewproject222(mynewproject222); 2008-6-22 {718} (#4516278@0)

if all 4 keywords are required, then "30234817 home based jobs" is not a match. awk is perfect tool to do this kind of job. for 1M lines text file, awk should be able to easily find all the records in 30 seconds. Is this fast enough for you? -holdon(again); 2008-6-22 (#4516386@0)

Thanks so much. Here it is.Thanks for your correction. The retrieve Rule is "A recored is matched if and only if each words in the record can be found in the keyword set." So 30234817 home based jobs" is still a match.

For the running time, I am looking for a much shorter time than 30 seconds as this is a Web-based system. It really has a high requirement for process time. -mynewproject222(mynewproject222); 2008-6-22 {355} (#4516801@0)

Thanks so much. I think this is a good approach.Unfortunately I do not have a database to use. All I have to do is to use code to finish this task. Therefore I am looking for a good algorithm that can do the match quickly.

Thanks again! I really appreciate your replies! -mynewproject222(mynewproject222); 2008-6-23 {226} (#4517744@0)

有什么工具就用什么方法。目前看来，你没有数据库，所以问题与SQL无关。既然你寻求纯C的解决方法，regular expression应该是最方便的方法？下面链接供你参考。 -bdbs(不得百失); 2008-6-23 (#4518163@0)

It is quite easy. Details inside -cerboros_redux(Cerboros Reborn); 2008-6-23 {528} (#4518246@0)

this is a interesting solution and could work if the keywords set is small. -holdon(again); 2008-6-23 {1381} (#4518387@0)

Thanks! this lets me think about bit operation. -mynewproject222(mynewproject222); 2008-6-24 (#4519328@0)

Build the key words in the key_file into a tree: -ruex(xeur); 2008-6-23 {515} (#4519191@0)

Thanks so much! The key_file has one million lines of records.Creating such trees should be possible, however, traversing these trees may take long time even I create Binary Search Tree.

As this is a Webbased system, I need to retrieve related information within 40 milliseconds. Therefore a fast algorithm is really important. -mynewproject222(mynewproject222); 2008-6-24 {268} (#4519338@0)

I really doubt you can do it in 40 ms. 1M records probably need at least 10M RAM after converted to integer array.40ms isn't even enough to read 10M RAM.also the algorithm is decided by the data, so before you decide the algorithm, if you already got the 1M records, you really should collect some statistic data first. how many keywords it has？ how many first_keyword and how many last_keyword? if you really want to make it as fast as possible, you might even need to care the frequency of each keyword. -holdon(again); 2008-6-24 {353} (#4519472@0)

For the 40ms, what I meant is the faster is the better... thanks for your reminding words. -mynewproject222(mynewproject222); 2008-6-24 (#4520167@0)

哈哈，一看就是没抄过课文的80后。这是为了帮你成为一个高尚的人，一个纯粹的人，一个有道德的人，一个脱离了低级趣味的人！ -newkid(newkid); 2008-6-25 (#4521745@0)

如果key set单词数很少，连与操作也可以转换为若干次比较。比如5个词就是2^5-1=31次寻址。这个方法不能告诉你哪些记录“是”它只能告诉你哪些“可能是”并快速排除那些“不是”的记录 -newkid(newkid); 2008-6-24 (#4520308@0)

Thanks for all your replies! Let me think over your replies and see what I can do next :) -mynewproject222(mynewproject222); 2008-6-24 (#4520178@0)
不是太明白你干嘛呢，不过无论是用perl，awk，还是导入到数据库里面都比用C重新造轮子强些。因为不管你怎么写，估计性能上绝对超不过这几个现成的tool. -canadiantire(轮胎 - Bona fide Crm); 2008-6-24 (#4520321@0)

@Florida