The AI set of functions

I recently read an article from Y. Bengio and Y. LeCun named “Scaling Learning Algorithms to AI”. You can also find it as a book chapter in “Large-Scale Kernel Machines”L. Bottou, O. Chapelle, D. DeCoste, J. Weston (eds) MIT Press, 2007.

In some aspects it is an “opinion paper” where the authors advocate for deep learning architectures and their vision of the Machine Learning. However, I think the main message is extremely relevant. I was actually surprised to see how much it agrees with my own opinions.
Here is how I would summarize it:

- no learning algorithm can be completely universal, due to the “No free lunch theorem”
- that’s not such a big problem: we don’t care about the set of all possible functions
- we care about the “AI set”, which contains the functions useful for vision, language, reasoning, etc.
- we need to create learning algorithms with an inductive bias towards the AI set
- the models should “efficiently” represent the functions of interest, in terms of having low Kolmogorov complexity
- researchers have exploited the “smoothness” prior extensively with non-parametric methods. However many manifolds of interest have strong local variations.
- we need to explore other types of priors, more appropriate to the AI set.

The authors then give examples of two “broad” priors, such as the sharing of weights in convolutional networks (inspired by translation invariance in vision) and the use of multi-layer architectures (which can be seen as levels of increasing abstraction).

Of course here is where many alternatives are open! Many other useful inductive-bias could be found. That’s where I think we should focus our research efforts! :)

NYC Machine Learning Symposium 2010

The event took place yesterday at the New York Academy of Sciences, a building right next to the World Trade Center. The views from the 40th floor were breathtaking:

The names of the participants in the room was no less impressive, (by no special order): Corinna Cortes (Google), Rob Schapire and David Blei (Princeton University), John Langford and Alex Smola (Yahoo), Yann LeCun (NYU), Sanjoy Dasgupta (Univ. California), Michael Collins (MIT), Patrick Haffner (AT&T), among many others.

I particularly liked to see the latest developments in LeCun’s group, including a demo by Benoit Corda and Clément Farabet on speeding-up Convolutional Neural Networks with GPUs and FPGAs.
Alex Kulezka and Ben Taskar had a nice work on “Structured Determinantal Point Processes”, which can be seen as a probabilistic model with a bias towards diversity of the hidden structures.
Mathew Hoffman (with D. Blei and F. Bach) used stochastic gradient descent (widely used among neural network community) for online training of topic models. Sean Gerrish and D. Blei actually had a funny application of topic models to the prediction of votes by Senators!
I was also happy to see that there is some Machine Learning being applied to the problem of sustainability and the environment. Gregory Moore and Charles Bergeron had a poster on trash detection in lakes, rivers and oceans.
To conclude, the best student paper award went to a more theoretical paper by Kareem Amin, Michael Kearns and Umar Syed (U Penn) called “Parametric Bandits, Query Learning, and the Haystack Dimension”, which defines a measure of complexity for multi-armed bandit problems in which the number of actions can be infinite (there is some analogy to the role of VC-dimension in other learning models).
There were probably many other interesting posters worth being mentioned, but I didn’t have the chance to check them all!
On the personal side: my summer internship at NEC Labs with David Grangier is about to finish. It was an amazing learning experience and I am very grateful for it.
Next step: back to Idiap Research Institute, EPFL and all the Swiss lakes and mountains! :)

Machine Learning recent sites

In the last few months (in which I haven’t posted in this blog) there were a few interesting web platforms related to Machine Learning poping-up, most notably:

MLcomp.org – you can upload your datasets and/or your algorithms, and experiments will run automatically. Then you can see statistics related to classifier performances and computation times. It is intended to help researchers and practitioners comparing different methods, and it works as a collaborative platform where code and data can be shared.
MetaOptimize.com – it contains a great QA about Machine Learning and related topics, using the same web platform StackOverflow has for programming topics.
I find these two websites a great way to improve collaboration among the ML community. Highly recommended!
The latest link is more market oriented, and it comes from Google:
Google Predict : it puts together well established ML algorithms in an API that developers can use to make predictions on their own datasets.

Open PhD and Postdoc positions

My supervisor is leading a new European project called MASH, which stands for “Massive Sets of Heuristics”. There are open positions here in Switzerland, as well as in France, Germany and Czech Republic.
The goal is to solve complex vision and goal planning problems in a collaborative way. It will be tested in 3D video games and also in a real robotic arm. Collaborators will submit pieces of code (heuristics) that can help the machine solving the problem at hand. In the background, machine learning algorithms will be running to choose the best heuristics.
If you are interested in: probabilities, applied statistics, information theory, signal processing, optimization, algorithms and C++ programming, you might consider applying!

Gmail Machine Learning

I just quickly tried the new Gmail Labs feature “Got the wrong Bob”? and it actually works quite nicely! I put some email addresses of family members, followed by the address of an old professor, who has the same first name of one of my cousins, and… Gmail found it! :) It suggested right way to change to the correct person, based on context!

The other new feature, called “Don’t forget Bob”, is probably simpler, but quite useful as well. As I typed names of some close friends, I got more suggestions of friends I often email jointly with the previous ones.
I wonder if the models to run this feature are very complicated. Probably they are not. I guess one just has to estimate the probability of each email address in our contacts to appear in the “To:” field, given the addresses we have already typed. To estimate these, you just have to use a frequentist approach and count how many times this happened in the past. With this in hands, “Got the wrong Bob?” will notice unlikely email addresses and “Don’t forget Bob” will suggest likely ones that are missing.

I think it’s a really cool idea, in the same spirit of “Forgotten Attachement Detector”. A bit of machine learning helping daily life!

Machine Learning artwork

Today I tried out a great site to generate tag clouds, it is called wordle.net. I rendered some images just by copy-pasting the text from wikipedia about machine learning.

The results were pretty cool and I guess one could print awesome t-shirts with them. What do you say?

One could also use them as wallpapers:

Update [2013] : due to an issue migrating images from blogspot, I kept only two images here.

Vapnik’s picture explained

This is an extremely geeky picture! :) Let’s try to explain it:

First of all, as many of you know, the gentleman in the picture is Prof. Vladimir Vapnik. He is famous for his fundamental contributions to the field of Statistical Learning Theory, such as the Empirical Risk Minimization (ERM) principle, VC-dimension and Support Vector Machines.

Then we notice the sentence in the board: it resembles the famous “All your base are belong to us“! This is a piece of geek culture that emerged after a “broken English” translation of a Japanese video game for Sega Mega Drive .

Wait, but they replaced the word “Base” by “Bayes”!?
Yes, that Bayes, the British mathematician known for the Bayes’ theorem.
Okay, seems fair enough, we are dealing with people from statistics…

By the moment we think things can not get more geeky, we realize there is scary inequality written on the top of the white board:

My goodness, what’s this?! Okay, that’s when things get really technical:
This is a probabilistic bound for the expected risk of a classifier under the ERM framework. In simple terms, it relates the classifier’s expected test error with the training error on a dataset of size l and in which the cardinality of the set of loss functions is N.
If I’m not mistaken, the bound holds with probability (1 – eta) and applies only to loss functions bounded above by 1.

Sweet! Now that we got the parts, what’s the big message?

Well, it’s basically a statement about the superiority of Vapnik’s learning theory over the Bayesian alternative. In a nutshell, the Bayesian perspective is that we start with some prior distribution over a set of hypothesis (our beliefs) and we update these according to the data that we see. We then look for an optimal decision rule based on the posterior distribution.
On the other hand, in Vapnik’s framework there are no explicit priors neither we try to estimate the probability distribution of the data. This is motivated by the fact that density estimation is a ill-posed problem, and therefore we want to avoid this intermediate step. The goal is to directly minimize the probability of making bad decision in the future. If implemented through Support Vector Machines, this boils down to finding the decision boundary with maximal margin to separate the classes.

And that’s it, folks! I hope you had fun decoding this image! :)

Computer Vision vs Computer Graphics

If I had to explain what computer vision is all about, in just one snapshot, I would show you this:

Computer Graphics algorithms go from the parameter space to the image space (rendering), computer vision algorithms do the opposite (inverse-rendering). Because of this, computer vision is basically a (very hard) problem of statistical inference.
The common approach nowadays is to build a classifier for each kind of object and then search over (part of) the parameter space explicitly, normally by scanning the image for all possible locations and scales. The remaining challenge is still huge: how can a classifier learn and generalize, from a finite set of examples, what are the fundamental characteristics of an object (shape, color) and what is irrelevant (changes in illumination, rotations, translations, occlusions, etc.).
This is what is keeping us busy! ;)

PS – Note that changes in illumination induce apparent changes in the color of the object and rotations induce apparent changes in shape!