If Amazon ran the arXiv
In a previous post I discussed the application of information-aggregation sites to the arXiv. (See also Daniel’s excellent links in the comments.) This post highlights a different underpinning of the Web 2.0: tracking and tailoring to user behaviour.
Amazon.com doesn’t administer the arXiv. (Cornell does.) But as one of the most successful Web 2.0 companies, it offers some important lessons.
Despite Amazon being a bazaar of internet commerce and the arXiv being a scientific resource, the two have a lot in common. Both are essentially large databases of information. Users of each service spend a lot of time perusing their options, looking for the [subjectively] right items to purchase/download.
But then there’s the big difference, and I’m not talking requiring a credit card to complete one’s transaction. Amazon works with its data. It correlates user data with viewing/purchasing data to find ways to better match users with items that they would be interested in. It provides user feedback so one can read reviews and suggestions from millions of other buyers. It even provides customer-generated rankings of other customer reviewers, so you know which comments are `reputable.’ In short, it gathers every scrap of information that it can, analyses the hell out of it, and then uses it to make it a little bit easier for you to find the book you want and buy it from them.
That sounds like an admirable benchmark for the arXiv to aspire to, doesn’t it? But as I mentioned in my first post, the arXiv’s nature as a science tool makes it a delicate system to modify. Objectivity is absolutely critical, especially in a system that makes suggestions to users. So, how does one incorporate the features that have made Amazon so successful while maintaining the scientific integrity of the database?
Tell me what I want, then offer it to me
Let’s ignore pragmatism for now and think big. What one wants from an augmented arXiv (or SPIRES) is for it to identify which papers are of interest to you. At the end of the day it’s about you being able to efficiently gather the papers that you want to be reading.
The first lesson from Amazon is that the key to serving arXiv’s users is to personalise the experience. In an ideal world, we would have a system that configures itself to us over time.
The moment you log on to Amazon, you are offered an array of items which Amazon thinks you might be interested in. Maybe your favourite author published a new book. Or maybe you need new batteries for that gizmo you bought last week. Or maybe your child’s birthday is coming up and you’re looking for the `must-have-toy’ of the year. Amazon knows these things because Amazon knows us.
This is what we would want from an `intelligent’ arXiv. It should learn about us without us doing anything other than our usual paper-browsing. It should keep track of which papers we eventually downloaded, which we skipped, and perhaps even what we publish. After accumulating years of data, such a system’s neural networks (or whatever fancy machine-learning algorithm) would be able to use key words, meta-data, and trends from every other user to predict which papers would be of particular interest to me every morning.
Objectivity from Automation
What about the scientific mission? In my previous post, I raised a red flag that allowing user comments on the arXiv would open it up to all sorts of biased politicking.
Amazon is able to automatically offer me books that I find interesting by recording and analysing every item I ever look at, what I look at afterwards, and what I end up buying. It then correlates this information with the purchasing patterns of every other Amazon user and deduces which items I would be likely to purchase. And it does this all without looking at ratings/comments, just purchasing behaviour.
Amazon’s second lesson, then, is to use automated data processing to preserve objectivity.
It’s not perfect. But at least it allows users to voice their opinions through their actions (which are hard to bias) rather than their words. One would still have crackpots and opportunists trying to beat the system. But the arXiv is already very good at rejecting bots, and outliers would stick out statistically against `honest’ arXiv users. One could further restrict to tracking only `authenticated’ users with a university e-mail account, or implement similar measures to restrict to `good’ user data.
Is it possible?
Ok, so maybe this all sounds a little Isaac Asimov. But if we suspend reservations about whether this is practical, it’s worth emphasizing that this is all possible. Amazon itself is the proof of principle. The software and hardware required is already in the commercial sector.
Even experimental particle physicists have been using neural networks for some time. I know of undergraduate computer scientists in the Bay Area (feeder to Silicon Valley) who could probably rewrite Amazon’s software to improve efficiency. [They assure me that they’re not building an army of AI Terminator robots to destroy humanity, and I assure them that we’re not building a black-hole/strangelet machine that will destroy the Earth.]
But the answer is yes, an Amazon-style arXiv is possible. That’s not what’s holding us back. So let’s look into the pitfalls of an Amazon-arXiv.
The economies of scale
If you ignore pre-existing booksellers like Barnes and Noble or Borders Books, Amazon has virtually no competition in the on-line book peddling business. This is because of the economies of scale. Amazon thrives on user data: the more people sign up and use their service, the better their personalisation algorithms become. (Even if people don’t end up purchasing from Amazon!)
In the heuristic graph above, Amazon’s effectiveness in offering targeted suggestions increases with the number of people who offer their viewing/purchasing data for analysis.
This, by the way, is the same reason why Wikipedia stands alone: the wiki-philosophy depends on having as large a user-base as possible offering revisions. Any competitor would just split the primary resource (users) so that neither site would be as strong as a single site with a monopoly.
There are two implications here.
First of all, pretenders won’t cut it. If you have sites like eprintweb and scirate trying to offer the same service, then they’ll suffocate each other. The services are only as good as the number of users, so competition hurts everyone.
Secondly, while it’s great when a service (like Amazon or Wikipedia) already has lots of users, it’s really hard to convince people to buy into such a service when there aren’t many users, i.e. when there’s a smaller corpus of data to feed into our software.
A good analogy (for readers of this blog at least) is that of a metastable vacuum: the current arXiv is a metastable vacua of `efficiency finding papers,’ point A in the diagram below.
There is a potential barrier (B) that makes it difficult for the community to go from A to C. If only a small fraction of the community buys into an Amazon-arXiv project, this ends up being more inconvenient since the personalisation software wouldn’t have enough data to be effective. At some critical participation rate, however, there is a phase transition and the system becomes more efficient, leading to more people signing up, which in turn improves the system’s efficiency, and so forth.
The problem is reaching the critical `buy in’ rate: why would anyone sign up for a service that isn’t useful now, but may be useful if all your friends sign up? (Sounds like a pyramid scheme, eh?)
Okay, I signed up for Amazon.com because I wouldn’t be able to buy stuff from it otherwise. However, it would take promises of gold, frankincense, and myrrh for me to give up my e-mail to sign-up for yet another service leeching user data.
Currently, the arXiv is open to anyone. Requiring registration would be an inconvenience that would be very difficult to justify. It’s a small inconvenience, but multiplied by a large number of users. If registration is optional, then there’s no immediate incentive for people to register or—having previously registered—to remember to sign in before searching.
This defeats the purpose of having an Amazon-arXiv: convenience. The ideal system would have users doing nothing beyond what they currently do to search the arXiv, and then having the system automatically cater to the user over time. If users have to remember to log in regularly, this becomes a regular annoyance long before the system is able to provide any tangible benefit.
This is the main hurdle (potential barrier, if you will) preventing any such system from being implemented.
One possible work-around would be to avoid tracking users directly, but instead through their IP addresses. The system loses a bit of precision because users might use a different computer at home versus in the office, but in principle this would provide 100% participation. This becomes nontrivial when one expects the arXiv to be able to identify users behind unversity networks; this is the same `problem’ the RIAA faces when trying to identify college students to
harass threaten with lawsuits. A less elegant solution would be an arXiv cookie, but this becomes browser-dependent rather than user dependent (e.g. searches made on public computers would be lost data).
During the preparation of this post, Amazon released it’s e-reader, the Kindle. I’ve long been a believer that digital paper would be the `killer app’ for the arXiv. It would allow researchers to easily carry around large numbers of papers, perhaps hyperlinked to one another and with the researcher’s digital notes scribbled in the digital margins. I even have a wishlist of features that I wrote over a year ago.
Based on what little I know about it, the Kindle is not the killer app I’ve envisioned. But it’s a step in the right direction. For now it’s a bit limited in format and distribution method. It’s software is meant for passive reading, with only limited functions for active scribbling. (One doesn’t type in comments when reading a paper, one needs to circle, make arrows, do calculations… preferably in different colors.) Who knows. Maybe the Kindle is the first step in a viable pdf e-book/e-print reader for scientists. The issue, of course, is demand: is there a large enough number of consumers who want such a product enough to motivate the market to develop it?
Filed under: Science 2.0 | 7 Comments