If Amazon ran the arXiv
In a previous post I discussed the application of information-aggregation sites to the arXiv. (See also Daniel’s excellent links in the comments.) This post highlights a different underpinning of the Web 2.0: tracking and tailoring to user behaviour.
Amazon.com doesn’t administer the arXiv. (Cornell does.) But as one of the most successful Web 2.0 companies, it offers some important lessons.
Despite Amazon being a bazaar of internet commerce and the arXiv being a scientific resource, the two have a lot in common. Both are essentially large databases of information. Users of each service spend a lot of time perusing their options, looking for the [subjectively] right items to purchase/download.
But then there’s the big difference, and I’m not talking requiring a credit card to complete one’s transaction. Amazon works with its data. It correlates user data with viewing/purchasing data to find ways to better match users with items that they would be interested in. It provides user feedback so one can read reviews and suggestions from millions of other buyers. It even provides customer-generated rankings of other customer reviewers, so you know which comments are `reputable.’ In short, it gathers every scrap of information that it can, analyses the hell out of it, and then uses it to make it a little bit easier for you to find the book you want and buy it from them.
That sounds like an admirable benchmark for the arXiv to aspire to, doesn’t it? But as I mentioned in my first post, the arXiv’s nature as a science tool makes it a delicate system to modify. Objectivity is absolutely critical, especially in a system that makes suggestions to users. So, how does one incorporate the features that have made Amazon so successful while maintaining the scientific integrity of the database?
Tell me what I want, then offer it to me
Let’s ignore pragmatism for now and think big. What one wants from an augmented arXiv (or SPIRES) is for it to identify which papers are of interest to you. At the end of the day it’s about you being able to efficiently gather the papers that you want to be reading.
The first lesson from Amazon is that the key to serving arXiv’s users is to personalise the experience. In an ideal world, we would have a system that configures itself to us over time.
The moment you log on to Amazon, you are offered an array of items which Amazon thinks you might be interested in. Maybe your favourite author published a new book. Or maybe you need new batteries for that gizmo you bought last week. Or maybe your child’s birthday is coming up and you’re looking for the `must-have-toy’ of the year. Amazon knows these things because Amazon knows us.
This is what we would want from an `intelligent’ arXiv. It should learn about us without us doing anything other than our usual paper-browsing. It should keep track of which papers we eventually downloaded, which we skipped, and perhaps even what we publish. After accumulating years of data, such a system’s neural networks (or whatever fancy machine-learning algorithm) would be able to use key words, meta-data, and trends from every other user to predict which papers would be of particular interest to me every morning.
Objectivity from Automation
What about the scientific mission? In my previous post, I raised a red flag that allowing user comments on the arXiv would open it up to all sorts of biased politicking.
Amazon is able to automatically offer me books that I find interesting by recording and analysing every item I ever look at, what I look at afterwards, and what I end up buying. It then correlates this information with the purchasing patterns of every other Amazon user and deduces which items I would be likely to purchase. And it does this all without looking at ratings/comments, just purchasing behaviour.
Amazon’s second lesson, then, is to use automated data processing to preserve objectivity.
It’s not perfect. But at least it allows users to voice their opinions through their actions (which are hard to bias) rather than their words. One would still have crackpots and opportunists trying to beat the system. But the arXiv is already very good at rejecting bots, and outliers would stick out statistically against `honest’ arXiv users. One could further restrict to tracking only `authenticated’ users with a university e-mail account, or implement similar measures to restrict to `good’ user data.
Is it possible?
Ok, so maybe this all sounds a little Isaac Asimov. But if we suspend reservations about whether this is practical, it’s worth emphasizing that this is all possible. Amazon itself is the proof of principle. The software and hardware required is already in the commercial sector.
Even experimental particle physicists have been using neural networks for some time. I know of undergraduate computer scientists in the Bay Area (feeder to Silicon Valley) who could probably rewrite Amazon’s software to improve efficiency. [They assure me that they’re not building an army of AI Terminator robots to destroy humanity, and I assure them that we’re not building a black-hole/strangelet machine that will destroy the Earth.]
But the answer is yes, an Amazon-style arXiv is possible. That’s not what’s holding us back. So let’s look into the pitfalls of an Amazon-arXiv.
The economies of scale
If you ignore pre-existing booksellers like Barnes and Noble or Borders Books, Amazon has virtually no competition in the on-line book peddling business. This is because of the economies of scale. Amazon thrives on user data: the more people sign up and use their service, the better their personalisation algorithms become. (Even if people don’t end up purchasing from Amazon!)
In the heuristic graph above, Amazon’s effectiveness in offering targeted suggestions increases with the number of people who offer their viewing/purchasing data for analysis.
This, by the way, is the same reason why Wikipedia stands alone: the wiki-philosophy depends on having as large a user-base as possible offering revisions. Any competitor would just split the primary resource (users) so that neither site would be as strong as a single site with a monopoly.
There are two implications here.
First of all, pretenders won’t cut it. If you have sites like eprintweb and scirate trying to offer the same service, then they’ll suffocate each other. The services are only as good as the number of users, so competition hurts everyone.
Secondly, while it’s great when a service (like Amazon or Wikipedia) already has lots of users, it’s really hard to convince people to buy into such a service when there aren’t many users, i.e. when there’s a smaller corpus of data to feed into our software.
A good analogy (for readers of this blog at least) is that of a metastable vacuum: the current arXiv is a metastable vacua of `efficiency finding papers,’ point A in the diagram below.
There is a potential barrier (B) that makes it difficult for the community to go from A to C. If only a small fraction of the community buys into an Amazon-arXiv project, this ends up being more inconvenient since the personalisation software wouldn’t have enough data to be effective. At some critical participation rate, however, there is a phase transition and the system becomes more efficient, leading to more people signing up, which in turn improves the system’s efficiency, and so forth.
The problem is reaching the critical `buy in’ rate: why would anyone sign up for a service that isn’t useful now, but may be useful if all your friends sign up? (Sounds like a pyramid scheme, eh?)
Signing up
Okay, I signed up for Amazon.com because I wouldn’t be able to buy stuff from it otherwise. However, it would take promises of gold, frankincense, and myrrh for me to give up my e-mail to sign-up for yet another service leeching user data.
Currently, the arXiv is open to anyone. Requiring registration would be an inconvenience that would be very difficult to justify. It’s a small inconvenience, but multiplied by a large number of users. If registration is optional, then there’s no immediate incentive for people to register or—having previously registered—to remember to sign in before searching.
This defeats the purpose of having an Amazon-arXiv: convenience. The ideal system would have users doing nothing beyond what they currently do to search the arXiv, and then having the system automatically cater to the user over time. If users have to remember to log in regularly, this becomes a regular annoyance long before the system is able to provide any tangible benefit.
This is the main hurdle (potential barrier, if you will) preventing any such system from being implemented.
One possible work-around would be to avoid tracking users directly, but instead through their IP addresses. The system loses a bit of precision because users might use a different computer at home versus in the office, but in principle this would provide 100% participation. This becomes nontrivial when one expects the arXiv to be able to identify users behind unversity networks; this is the same `problem’ the RIAA faces when trying to identify college students to harass threaten with lawsuits. A less elegant solution would be an arXiv cookie, but this becomes browser-dependent rather than user dependent (e.g. searches made on public computers would be lost data).
Bonus: e-Readers
During the preparation of this post, Amazon released it’s e-reader, the Kindle. I’ve long been a believer that digital paper would be the `killer app’ for the arXiv. It would allow researchers to easily carry around large numbers of papers, perhaps hyperlinked to one another and with the researcher’s digital notes scribbled in the digital margins. I even have a wishlist of features that I wrote over a year ago.
Based on what little I know about it, the Kindle is not the killer app I’ve envisioned. But it’s a step in the right direction. For now it’s a bit limited in format and distribution method. It’s software is meant for passive reading, with only limited functions for active scribbling. (One doesn’t type in comments when reading a paper, one needs to circle, make arrows, do calculations… preferably in different colors.) Who knows. Maybe the Kindle is the first step in a viable pdf e-book/e-print reader for scientists. The issue, of course, is demand: is there a large enough number of consumers who want such a product enough to motivate the market to develop it?
Filed under: Science 2.0 | 7 Comments
There are quite a few eInk based devices out there…
The iRex iLiad seems to be the best device out there at the moment…
it actually interfaces with a stylus and seems to be really sensibly designed.
All the links are on the wikipedia page http://en.wikipedia.org/wiki/E-book_device
If any of your readers wish to purchase me one… I’d be more than happy to give a complete review!!
Simon
Browsing could be open but recommendations would require login. The initial recommendation engine without prior knowledge of user tastes does not have to be random/useless. It could be based on the reference list of the papers and even without user information it could be interesting enough to get people to login.
The network effects you mention have interesting implications. Intuitively, at least, they will lead to monopoly that tends to hinder innovation. I think it is important to find ways to develop these “social” applications in a way that the underlaying social aspects are not tied to any single application but can be used by anyone. This way any application can benefit from the network effect without creating any lock-in monopolies.
Hi Pedro! I’m intrigued by the idea of having a ‘seed’ database, but it seems that this would be human-generated and hence manifestly not objective.
Regarding the idea of a monopoly, I think the point that you’re trying to make is that Amazon.com’s machine-learning algorithms are [probably] proprietary and closed to the public. Some engineers at Amazon wrote code to parse data, and nobody gets to see this code. For a system like an Amazon-style arXiv, this needn’t be the case. The AI engine could be completely open source.
In fact, one could allow users to tweak their own machine-learning parameters, or write their own code to parse through the database of user behaviour. My point with the `economies of scale’, however, is that there should only be ONE such database. It should be accessible to everyone, and they can use their own software, but it becomes ineffective if there are five different arXiv’s each taking different data about its users.
Regarding the priors for paper recommendation they do not have be human generated. If no user habits are available the recommendations could be based simply on based on reference analysis. This is what Pubmed does already if I am not mistaken. It could even have a small tweak if the user inputs a list of keywords when they register. From there, the more the user would use the system the more the recommendations would change from the ones that everyone gets based on reference analysis to add user specific tastes.
I was not referring to monopoly in what regards the proprietary code. The monopoly comes from closing the user information. The more users, the more useful a site becomes and if this site does not share this user data then it can become a monopoly not because the code or site is better than others (technologically) but because the users will not move somewhere else where no-one is. That is why the underlying user information should be open to spur innovation. Like you say , the database of users data should be available for anyone and then different interfaces could compete to provide a better experience. 5 different arXiv’s with separate user information databases would just be fragmenting the space and hurting themselves.
Quite funny- while reading the first few lines of your post, I mis-read information aggregation as “information aggression”. I think it is what is happening these days a bit everywhere. Not a new phenomenon, but becoming worse with the advent of automated blogs.
Cheers,
T.