This topic has been archived. It cannot be replied.
-
工作学习 / 学科技术讨论 / 感谢大家的回答. 接受批评, 再重写一次. SQL问题, 谢谢!
-mynewproject222(mynewproject222);
2008-6-21
{1312}
(#4515695@0)
-
实际这是个算法问题,谢谢!
-mynewproject222(mynewproject222);
2008-6-22
(#4515734@0)
-
这要看你是经常查同一个文件,还是只是查一次。也就是说,查100次,平均最快和查一次最快的算法是不一样的。还有你查“home","based",“job"那么包不包括"homebased job", "home-based job","home based jobs".
-niu1986(只吃草的牛);
2008-6-22
(#4515745@0)
-
This file will be checked many times. It is not just one time. For your question, given keywords "home", "based" and "job", then "homebased job", "home-based job" and "home based jobs" should be all retrieved.
-mynewproject222(mynewproject222);
2008-6-22
(#4516275@0)
-
First, this is not a SQL question. second, you still didn't make it clear. say you have 4 keyword: home based free jobs, should the matched record has all the keywords? or 3 of 4? or 2 of 4? or only one keyword is enough?I already told you to use awk to parse static txt file. you can google 'awk manual' to check how to write awk script.
-holdon(again);
2008-6-22
{120}
(#4515796@0)
-
Retrieve Rule and About AWK
-mynewproject222(mynewproject222);
2008-6-22
{718}
(#4516278@0)
-
if all 4 keywords are required, then "30234817 home based jobs" is not a match. awk is perfect tool to do this kind of job. for 1M lines text file, awk should be able to easily find all the records in 30 seconds. Is this fast enough for you?
-holdon(again);
2008-6-22
(#4516386@0)
-
Thanks so much. Here it is.Thanks for your correction. The retrieve Rule is "A recored is matched if and only if each words in the record can be found in the keyword set." So 30234817 home based jobs" is still a match.
For the running time, I am looking for a much shorter time than 30 seconds as this is a Web-based system. It really has a high requirement for process time.
-mynewproject222(mynewproject222);
2008-6-22
{355}
(#4516801@0)
-
一些建议:
-holdon(again);
2008-6-23
{1199}
(#4517289@0)
-
Thanks so much. I think this is a good approach.Unfortunately I do not have a database to use. All I have to do is to use code to finish this task. Therefore I am looking for a good algorithm that can do the match quickly.
Thanks again! I really appreciate your replies!
-mynewproject222(mynewproject222);
2008-6-23
{226}
(#4517744@0)
-
有什么工具就用什么方法。目前看来,你没有数据库,所以问题与SQL无关。既然你寻求纯C的解决方法,regular expression应该是最方便的方法?下面链接供你参考。
-bdbs(不得百失);
2008-6-23
(#4518163@0)
-
Thanks a lot!
-mynewproject222(mynewproject222);
2008-6-24
(#4519317@0)
-
It is quite easy. Details inside
-cerboros_redux(Cerboros Reborn);
2008-6-23
{528}
(#4518246@0)
-
this is a interesting solution and could work if the keywords set is small.
-holdon(again);
2008-6-23
{1381}
(#4518387@0)
-
that is interesting吗?我看简直就是脱了什么放什么。
-bdbs(不得百失);
2008-6-23
(#4519066@0)
-
Thanks so much!
-mynewproject222(mynewproject222);
2008-6-24
(#4519327@0)
-
Thanks! this lets me think about bit operation.
-mynewproject222(mynewproject222);
2008-6-24
(#4519328@0)
-
Build the key words in the key_file into a tree:
-ruex(xeur);
2008-6-23
{515}
(#4519191@0)
-
Thanks so much! The key_file has one million lines of records.Creating such trees should be possible, however, traversing these trees may take long time even I create Binary Search Tree.
As this is a Webbased system, I need to retrieve related information within 40 milliseconds. Therefore a fast algorithm is really important.
-mynewproject222(mynewproject222);
2008-6-24
{268}
(#4519338@0)
-
I really doubt you can do it in 40 ms. 1M records probably need at least 10M RAM after converted to integer array.40ms isn't even enough to read 10M RAM.also the algorithm is decided by the data, so before you decide the algorithm, if you already got the 1M records, you really should collect some statistic data first. how many keywords it has? how many first_keyword and how many last_keyword? if you really want to make it as fast as possible, you might even need to care the frequency of each keyword.
-holdon(again);
2008-6-24
{353}
(#4519472@0)
-
For the 40ms, what I meant is the faster is the better... thanks for your reminding words.
-mynewproject222(mynewproject222);
2008-6-24
(#4520167@0)
-
#4518246的方法是可以考虑改良一下的
-newkid(newkid);
2008-6-24
{834}
(#4520099@0)
-
因此我上面说这个方法是脱裤子放X多此一举。你这改良一下,就是早早的脱好了等着。:)
-bdbs(不得百失);
2008-6-24
(#4520180@0)
-
你有点堕落了哈,以前不这么粗俗的。
-newkid(newkid);
2008-6-24
(#4520202@0)
-
呵呵,现在不是流行通俗么?楼下用厕所演示OO很形象生动嘛。
-bdbs(不得百失);
2008-6-24
(#4520275@0)
-
罚抄“纪念白求恩”100遍
-newkid(newkid);
2008-6-24
(#4520305@0)
-
这个,有关联么?
-bdbs(不得百失);
2008-6-24
(#4521030@0)
-
哈哈,一看就是没抄过课文的80后。这是为了帮你成为一个高尚的人,一个纯粹的人,一个有道德的人,一个脱离了低级趣味的人!
-newkid(newkid);
2008-6-25
(#4521745@0)
-
俺有那么年轻就好了。:)不过确实没抄过课文,因为从小都是好学生。
-bdbs(不得百失);
2008-6-25
(#4521850@0)
-
如果key set单词数很少,连与操作也可以转换为若干次比较。比如5个词就是2^5-1=31次寻址。这个方法不能告诉你哪些记录“是”它只能告诉你哪些“可能是”并快速排除那些“不是”的记录
-newkid(newkid);
2008-6-24
(#4520308@0)
-
Thanks for all your replies! Let me think over your replies and see what I can do next :)
-mynewproject222(mynewproject222);
2008-6-24
(#4520178@0)
-
不是太明白你干嘛呢,不过无论是用perl,awk,还是导入到数据库里面都比用C重新造轮子强些。因为不管你怎么写,估计性能上绝对超不过这几个现成的tool.
-canadiantire(轮胎 - Bona fide Crm);
2008-6-24
(#4520321@0)