This topic has been archived. It cannot be replied.
-
工作学习 / 专业技术讨论 / 这里有做搜索引擎的大虾么,一起交流交流
-integ(空谷清音);
2005-11-20
(#2618539@0)
-
没做过搜索引擎,太高深。但是自己玩过,就是写段程序把一个网站的html全部当作一个巨大的string下载下来,分析里面的电子邮箱,各种链接等等。如果是XHtml,则会方便很多,用XSLT可以直接把里面感兴趣的Node分离出来我估计google,msn search用的也是相似的原理,定期用这种网络机器人搜索网站里的所有链接。搂主有何高见,小弟愿意听听
-binghongcha76(一只大猫);
2005-11-21
{112}
(#2619652@0)
-
i am new to xml/xslt. based on what I know, the xslt is for translate xml to html. does it also able to manupluate html/xhtml? is xhtml xml compatible html? Thanks.
-647i(-);
2005-11-21
(#2620864@0)
-
是这样,说得准确些应该是利用XPath/XSLT。因为XHtml是用标准XML做成的Html网站,其本身是一个复合W3C标准的XML文件。所以我就可以用XPath找到我感兴趣的Node。找到之后我用XSLT把这些node变成我自己需要的另外一种XML文件方式这几步都是在XSLT转换的时候自动完成的,因为我需要用XPath在做XSLT转换的时候找到XHtml内的节点
-binghongcha76(一只大猫);
2005-11-22
{90}
(#2621310@0)
-
how about DOM?
-647i(-);
2005-11-22
(#2621317@0)
-
u mean xhtml can be trasfered to html by xslt/xpath. how to translate html to xml/xhtml? any translation tool or validation tool?
-647i(-);
2005-11-22
(#2621319@0)
-
DOM我的理解就是一个容器,把XML文件读到DOM里面会生成一个树形结构的object,然后可以对其操作,包括用XPath查找,增加修改或者删除Node. 这些用JavaScript或者.net都很容易实现
-binghongcha76(一只大猫);
2005-11-22
(#2621327@0)
-
http://www.toronto1.biz/xml 上面地址有我写的一个XML DOM操纵SDK。支持DTD校验,支持DOM操作,支持中文GB2312编码。用C写的,有参考手册。
-walacato(walacato);
2005-11-22
(#2622454@0)
-
看了,不错,您在那个NetSoft 公司工作?
-binghongcha76(一只大猫);
2005-11-22
(#2622968@0)
-
理论上当然可以把XHtml变成普通Html,但我看不出来有什么意义。但通过XSLT把HTML转成XHTML我也不知道怎么转,因为HTML文件存在一些不符合XML标准的标记,比如<br>,当XSLT Parser试图读取这种不标准的XML文件时会报错具体的转换工具肯定有,但我没用过。我当年接触XHTML只是好奇玩了玩,没有深究过。你可以在google上打入 how to convert HTML to XHTML 试试
用最新的VS2005,里面的所有web site 都已经是默认用XHTML了
-binghongcha76(一只大猫);
2005-11-22
{191}
(#2621328@0)
-
use TIDY you can convert any html to XHTML.
-schen(睹往睹来.非赌徒也!);
2005-11-22
(#2621816@0)
-
XSLT并不仅仅是用来把XML转成HTML的,它是把一种XML格式转换成另外一种任意格式(可以是Text,XML,HTML,etc)的标记语言
-binghongcha76(一只大猫);
2005-11-22
(#2621314@0)
-
这叫做CRAWLER或者SPIDER,做这个东西容错性是关键,无论是HTML/XHTML还是垃圾狗屎(很多网站HTML代码都是乱来的)你都能正确分析出来起码不会CRASH,还有部分网站使用JS动态生成链接,还要INVOKE JS得到结果
-google2002(Google);
2005-11-23
(#2624512@0)
-
PM sent.
-xfile(猪博士◎Joobs);
2005-11-21
(#2620190@0)
-
Thank everyone for your replies!
-integ(空谷清音);
2005-11-21
(#2620561@0)
-
MARK
-647i(-);
2005-11-21
(#2620704@0)
-
主要看scale我玩过一段时间 lucene, nutch。功能做出来问题不大,但是如何能适应大批量的网页,大批量的查询,就很多讲究了。
-benlin(默默向上游);
2005-11-22
{106}
(#2620962@0)
-
带宽是钱的问题, 内部我用的是Apache HTTP (Only act as a Load Dispatcher) +
Clustered Application Servers with Session Infinity & Failover enabled +
(Parallel) Database Server
Scalability 应该不是问题。可以加很多台App Servers, DB Servers.
-integ(空谷清音);
2005-11-22
{213}
(#2622440@0)
-
你这个根本不叫搜索引擎,充其量就是后台跑RMDMS TEXT SEARCH的普通网站而已,没有什么技术含量。一个好的搜索引擎,从前台HTTP SERVER,到WEBSPIDER,到文字PARSER和DISPATCHER都应该自己写,而且目前来看只能用C/C++写
-google2002(Google);
2005-11-23
(#2624503@0)
-
也不尽然,java有Lucene,但是面向中小企业级别的……
-bjrenzx1(机器卡);
2005-11-23
(#2624571@0)
-
呵呵,我心目中只有GOOGLE, BAIDU这些面向INTERNET用户的才算搜索引擎。自己用C写轻量级HTTP SERVER,一台PC能支撑10,000个并发链接,JAVA估计不行
-google2002(Google);
2005-11-23
(#2624627@0)
-
那倒是,不过人家的专利算法结构什么的也不会外露……不象Lucene有公共的东西
-bjrenzx1(机器卡);
2005-11-24
(#2624719@0)
-
如果我现在设计一个搜索引擎,分词算法和RANK比不上google,但是分布式设计方面不会比它差多少
-google_abcd(-1);
2005-11-24
(#2625882@0)
-
You mean to re-write distruibute transaction services and failover, load balance cross nodes using C++ from scratch, including the communication protocols on top of TCP/IP, if so you are really sombody.
-flipper_duckball(忘带枪的战士);
2005-11-25
(#2627718@0)
-
I am doing these stuff everyday. Actually as long as you get involved in one such project you won't think it is very hard(However, I am not able to implement distributed transcation, too hard).In fact, I was also doing the same thing in China before. That is why I get a good job in one of the best IT companies of Canada just after 3 months I landed here last year.
-googleabcd(古狗);
2005-11-25
{173}
(#2627843@0)
-
Can you tell me any commercial products you are involving in?actually I'm interested in the core implementation of failover and load-balance. Do you use the third-party product to implement distributing replicas, some group technologies with distributed objects like JGroup or write it by yourself based on Socket and Multicast?
-flipper_duckball(忘带枪的战士);
2005-11-25
{269}
(#2627874@0)
-
We don't use any third party library in our load-balance/fail-over design. All codes are written by ourselves using C/C++,Socket,Broaadcast on Unix/Linux platforms
-googleabcd(古狗);
2005-11-26
(#2628666@0)
-
不对吧,今天公司刚学的,如果申请了专利,算法就会被公布,但是别人不能使用罢了。
-naug(xiaoxiao);
2005-12-2
(#2641012@0)
-
还是你牛啊!如果你把名字起成Google1995, 那么Google 的老板就不是Sergey Brin 和Larrence Page, 而是阁下你了。开个玩笑。
-integ(空谷清音);
2005-11-24
{104}
(#2626046@0)
-
I am developing a small search engine in spare time. But it is just for fun:)
-googleabcd(古狗);
2005-11-25
(#2627845@0)
-
Google最牛的并不是搜索引擎而是他们内部那套文件系统。
http://labs.google.com/papers/gfs-sosp2003.pdf
他们使用廉价的PC,cluster起来实现分布式并行计算,随时哪个节点出错,都可以把任务切换到另外节点上。
你用"google file system"作关键字能找到更多的资料。
-benlin(默默向上游);
2005-11-25
{231}
(#2626968@0)
-
what are u going to do?
-647i(-);
2005-11-22
(#2622687@0)
-
Mining and search
-integ(空谷清音);
2005-11-24
(#2626038@0)
-
I am working some work on it, what's your scale? GB, 100GB or TB level?
-647i(-);
2005-11-24
(#2626664@0)
-
基于Java的全文索引引擎Lucene简介:http://www.chedong.com/tech/lucene.html
-bjrenzx1(机器卡);
2005-11-23
(#2624578@0)
-
Weakness of Lucene
-integ(空谷清音);
2005-11-24
{954}
(#2625948@0)
-
it is said that lucene support chinese text fulltext very well than MySQL. Other database like Oracle....have not done any research yet.
-647i(-);
2005-11-24
(#2626674@0)
-
I am doing clustering algorithm for text categorization.
-liyaobin(BigBen);
2005-11-24
(#2626209@0)
-
another vivisimo-like system? it's too late, I think.
-benlin(默默向上游);
2005-11-25
(#2626978@0)
-
Hoho, so what is not too late in searching/mining field? Please?
-liyaobin(BigBen);
2005-11-25
(#2627835@0)
-
no offence.my company has a very good team, and implemented a "vivisimo-like search engine" in July, 2004. The timing was pretty good, because Google just went to Nasdaq. But we never got enough VC. So basically I don't see any bright future for such an website again.
I said the team was good, because we have a Hardvard professor, a professional manager in Sillicon Vellage, a technical manager from IBM China, and a senior programmer.
-benlin(默默向上游);
2005-11-25
{433}
(#2628567@0)
-
Thx for your info. My "hoho" actually means "xixi", nothing else. I am working on a small-size automatic "google news", mostly from a academic perspective. Any comment? Thx again for your graceful reply.
-liyaobin(BigBen);
2005-11-26
(#2629326@0)
-
我也要帮一个朋友做一个玩玩,以后有问题向大家请教了
-rabbitbug(兔八哥);
2005-11-26
(#2628750@0)
-
Sohu 第一代 Search Engine 就是本人做的.(部分) . hehe//
-mondaycat(catt);
2005-11-26
(#2629329@0)
-
So?
-liyaobin(BigBen);
2005-11-26
(#2629342@0)