Haystack: 07/01/2007

July 28, 2007

Sanyo VPC-E1 Underwater Camcorder

I've been looking for the right video camera. Quite a challenge since I'd like professional quality the size and price of a pin.
For many people, the Flip may be the bee's knees, but I'd like to study some underwater issues -- duck dives, kayak rolls, trout, hatches.
I just ordered this Sanyo VPC-E1 in blue. As far as I can see, its only competition is the same model in white.
Other approaches to inexpensive youtube quality video + some underwater capactity all depend upon underwater housing. Not for me. We shall see.

next generation reading

Next generation of humans, that is.
Andrew Savikas of O'Reilly gave some of his blog space to two summer interns. Read all about their approach to teenage reading.

July 24, 2007

Flip Video

Here is a sample of Flip Video snatched off the web. Flip Video camera :: 30-60 minutes of decent video and almost decent sound. Ask me: "Is it easy to use?" I'll tell you no lie: "Yes".

July 23, 2007

Adobe AIR and Rich Internet Applications

Rich Internet Applications (RIA) have been a long time comming, but tools are emerging which will make them affordable for a wide range of developers.

Adobe's AIR is one foundation, a complex suite of tools from Google is another.

With proper use, the two are compatable.

open source portals

Manageability maintains a list of open source portals written in Java. I'm looking for a similar list for

Ruby that extends beyond Ruby on Rails
Adobe's platform

July 21, 2007

Web Trends 2007 v 2.0

Japan's Information Architects has published a subway-style map of the 200 most successful websites on the web, ordered by category, proximity, success, popularity and perspective.

The map is dense; it requires attention; it is likely to be useful to anyone who is "...exploring, defining and explaining the Internet strategy and positioning."

Its worthwhile to examine the IA home page as well as the map since the homepage includes, discussion, feedback, etc.

July 11, 2007

Nutch

Nutch is an open source Java implementation of a search engine. It provides all of the tools you need to run your own search engine. But why would anyone want to run their own search engine? After all, there's always Google. There are at least three reasons.

Transparency. Nutch is open source, so anyone can see how the ranking algorithms work. With commercial search engines, the precise details of the algorithms are secret so you can never know why a particular search result is ranked as it is. Furthermore, some search engines allow rankings to be based on payments, rather than on the relevance of the site's contents. Nutch is a good fit for academic and government organizations, where the perception of fairness of rankings may be more important.
Understanding. We don't have the source code to Google, so Nutch is probably the best we have. It's interesting to see how a large search engine works. Nutch has been built using ideas from academia and industry: for instance, core parts of Nutch are currently being re-implemented to use the Map Reduce distributed processing model, which emerged from Google Labs last year. And Nutch is attractive for researchers who want to try out new search algorithms, since it is so easy to extend.
Extensibility. Don't like the way other search engines display their results? Write your own search engine--using Nutch! Nutch is very flexible: it can be customized and incorporated into your application. For developers, Nutch is a great platform for adding search to heterogeneous collections of information, and being able to customize the search interface, or extend the out-of-the-box functionality through the plugin mechanism. For example, you can integrate it into your site to add a search capability.

Nutch installations typically operate at one of three scales: local filesystem, intranet, or whole web. All three have different characteristics. For instance, crawling a local filesystem is reliable compared to the other two, since network errors don't occur and caching copies of the page content is unnecessary (and actually a waste of disk space). Whole-web crawling lies at the other extreme. Crawling billions of pages creates a whole host of engineering problems to be solved: which pages do we start with? How do we partition the work between a set of crawlers? How often do we re-crawl? How do we cope with broken links, unresponsive sites, and unintelligible or duplicate content? There is another set of challenges to solve to deliver scalable search--how do we cope with hundreds of concurrent queries on such a large dataset? Building a whole-web search engine is a major investment. In " Building Nutch: Open Source Search," authors Mike Cafarella and Doug Cutting (the prime movers behind Nutch) conclude that:

... a complete system might cost anywhere between $800 per month for two-search-per-second performance over 100 million pages, to $30,000 per month for 50-page-per-second performance over 1 billion pages.

This series of two articles shows you how to use Nutch at the more modest intranet scale (note that you may see this term being used to cover sites that are actually on the public internet--the point is the size of the crawl being undertaken, which ranges from a single site to tens, or possibly hundreds, of sites). This first article concentrates on crawling: the architecture of the Nutch crawler, how to run a crawl, and understanding what it generates. The second looks at searching, and shows you how to run the Nutch search application, ways to customize it, and considerations for running a real-world system.

--thanks to java.net & Tom White for the summary

Haystack