Creating a Search Engine
I’ve been playing around with the idea of creating a vertical search engine in my spare time (not sure when that even is!) and came across a few really interesting documents that have been guiding me through the anatomy of the search field. Prior to this, I had no real formal training in search other than understanding how search marketing works and following Bruce Clay during my days in marketing Conducive Corporation and product management at FortuneCity.
The monster “thesis” that I have been reading thoroughly is The Anatomy of a Large-Scale Hypertextual Web Search Engine written by the infamous Sergey Brin and Larry Page during their time at Stanford University. I guess you cannot argue with Sergey and Larry as they’ve done it right so far… I mean, a marketcap of $150.64B is more than proving their concept.
I have also spent a fair amount of time reading up on the ht://Dig movement for Internet search engines. Their site breaks down the search methodology into three seperate steps including digging, merging, and searching. It’s a solid primer to show you how important each of the three steps are.
After a bunch of research not limited to the above, I went and read up on the Google Mini and the Google Search Appliance. I came to the conclusion that I believe that these hardware/software platforms are perfect as they are essentially what I need in a rack mounted server, to be plugged in a data center and after a quick setup, I could be on my way customizing the front-end. I could purchase a unit for under $10,000 and be well on my way of indexing for my vertical search project… but not so fast.
I traded emails with Google’s Enterprise Team early this morning (Monday) and they told me that their products could do what I wanted (within reason) but their EULA does not allow for this. Here is their exact response:
Technically this is doable but the Google Mini EULA won’t allow you to do so. In fact, you may crawl any content that is located on servers that are owned and operated by you or servers that are operated for your benefit. For example, you may crawl your content that resides on servers operated by your hosting company.
Bummer. I thought I had found the short cut to creating a search engine but unfortunately, it’s probably going to be a bit harder now. I’m not 100% sure what I’m going to be using yet nor if I’m going to go forward with this project, but it’s something in which I’m learning a ton and I’m having a great time doing it. If you’ve got any technical search expertise, I’d love to chat with you to hear how you’ve deployed and what tradeoffs you’ve made in terms of hardware and software.
Also, to round out all of Google’s search products, I have been playing with co-op, but need something wholly owned by me as well, as, the ability to customize the front-end and back-end as the project is more than a search engine (but it plays a large role).