Published on Nov 14, 2015
The idea of a universal anytime intelligence test is introduced here. The meaning of the terms “universal” and “anytime” is manifold: the test should be able to measure the intelligence of any biological or artificial system that exists at this time or in the future.
It should also be able to evaluate both inept and brilliant systems (any intelligence level) as well as very slow to very fast systems (any time scale). Also, the test may be interrupted at any time, producing an approximation to the intelligence score, in such a way that the more time is left for the test, the better the assessment will be.
In order to do this, the test proposal is based on previous works on the measurement of machine intelligence based on Kolmogorov complexity and universal distributions, which were developed in the late 1990s (C-tests and compression-enhanced Turing tests). It is also based on the more recent idea of measuring intelligence through dynamic/interactive tests held against a universal distribution of environments. Some of these tests are analysed and their limitations are highlighted so as to construct a test that is both general and practical. Consequently, ideas for a more recent definition of “universal intelligence” in order to design new “universal intelligence tests” are introduced, where a feasible implementation has been a design requirement. One of these tests is the “anytime intelligence test”, which adapts to the examinee’s level of intelligence in order to obtain an intelligence score within a limited time
Works on enhancing or substituting the Turing Test by inductive inference tests were developed, using Solomonoff prediction theory and related notions, such as the Minimum Message Length (MML) principle. This resulted in the introduction of induction-enhanced amd compression enhance turing tests. The basic idea was to construct a test as a set of series whose shortest pattern had no alternative projectible patterns of similar complexity. That means that the “explanation” of the series had to be much more plausible than other plausible hypotheses
The definition was given as the result of a test, called C-test, formed by computationally-obtained series of increasing complexity. The sequences were formatted and presented in a quite similar way to psychometric tests and, as a result, the test was administered to humans, showing a high correlation with the results of a classical psychometric (IQ) test on the same individuals. Nonetheless, the main goal was that the test could eventually be administered to other kinds of intelligent beings and systems. This was planned to be done, but the work from showed that machine learning programs could be specialised in such a way that they could score reasonably well on some of the typical IQ tests.
This unexpected result confirmed that C-tests had important limitations and could not be considered universal, i.e., embracing the whole notion of intelligence, but perhaps only a part of it. Other intelligent tests using ideas from algorithmic information theory or compression theory have also been developed. Recent works by Legg and Hutter , gave a new definition of machine intelligence, dubbed “universal intelligence”, also grounded in Kolmogorov complexity and Solomonoff’s (“inductive inference” or) prediction theory. The key idea is that the intelligence of an agent is evaluated as some kind of sum (or weighted average) of performances in all the possible environments. Taking Legg and Hutter’s definition of Universal Intelligence as a basis, a refinement and improvement of their work was done. First some issues require a clarification or a correction was addressed and, once they are clarified, an anytime universal intelligence test was developed.
The above definition captures one of the broadest definitions of intelligence: “the ability to adapt to a wide range of environments”. However, there are three obvious problems in this definition regarding making it practical. First, we have two infinite sums in the definition: one is the sum over all environments, and the second is the sum over all possible actions (agent’s life in each environment is infinite). And, finally, K is not computable.
Thus, just making a random finite sample on environments, limiting the number of interactions or cycles of the agent with respect to the environment and using some computable variant of K, is sufficient to make it a practical test. 4.1 SAMPLING ENVIRONMENTS Among the infinite number of environments, many environments (either simple or complex) will be completely useless for evaluating intelligence, e.g., environments that stop interacting, environments with constant rewards, or environments that are very similar to other previously used environments, etc. Including some, or most, of them in the sample of environments is a waste of testing resources; if we are able to make a more accurate sample, we will be able to make a more efficient test procedure. In an interactive environment, a clear requirement for an environment to be discriminative is that what the agent does must have consequences on rewards.
Without any restriction, many (most) simple environments would be completely insensitive to agents’ actions. So, number of environments are restricted to be sensitive to agents’ actions. That means that a wrong action (e.g., going through a wrong door) might lead the agent to part of the environment from which it can never return, but at least the actions taken by the agent can modify the rewards in that subenvironment. More precisely, we want an agent to be able to influence rewards at any point in any subenvironment. Such an environment is knoen as reward sensitive environment.
The definition given above is now feasible and stable with respect to varying m and ni. But there is no reference to physical time. Universal test had been considered to be generalising C-test from passive environments to active environments. Time should be considered in the measurement. Therefore , reference to time is important. the use of physical time may refer either to the environment or to the agent since both interact and both of them can be either fast or slow. If we consider how physical time may affect an environment, i.e., the environment’s speed, it is unacceptable to have an interactive test where the agent has to wait several hours after each action in order to see the reward and the observation.
On the other hand, when we generally refer to time when measuring intelligence, especially in noninteractive tests, it is assumed that we are talking about the agent’s speed. Slow agents cannot be considered equal with fast agents. 5.1 TIME AND REWARDS Consider time either as a limit to get agents’ actions or as a component of the final score. there are many options for incorporating time. Considering that we have an overall time τ for an environment, one option is to set a time-out τo for each action (with τo<= τ ) such that if the agent does not select an action within that time, reward 0 is given (or a random action is performed). The shorter the time-out is, the more difficult the test is. An alternative possible solution would be to set a fixed time, a time-slot τs (instead of a time-out) for each interaction (with τs<= τ ). But, again, given an overall time τ ,we do not know how much slots we need to generate.
Considering (randomly chosen) different-length time-slots for several interactions, a quick agent would be able to perform appropriate actions for more interactions than a slow agent with the same potential intelligence. However, it is not easy to tune these time-slots independently from the agent and, in any case, it is not very sensible to make the agent wait for some observations and rewards if we want to make a practical and efficient test.
As a result, if we do not assign time-slots, necessarily the rewards obtained in an environment during an overall time τ must be averaged, otherwise very fast but dull (slightly better than random) agents would perform well. The natural idea is to average by the number of interactions that the agent finally performs in time τ. However, a shrewd policy here would be to act as a fast random agent until the average reward becomes larger than a threshold (this can happen with greater or lower probability depending on the threshold) and then stop acting. For instance, consider an agent that performs one action randomly. If the reward is positive, then stop (no other action is performed). If the reward is negative, then act fast and randomly until the average reward is positive and then stop. Note that this strategy ensures a positive reward in balanced environments. Consequently, an agent could get a very good result by very fast (and possibly lucky) first interactions and then rest on its laurels, because the average so far was good.
The following items summarise the main features of the various new intelligence tests we have introduced:
• The distribution of environments is based on Ktmax (a bounded and computable version of K). There are many reasons for this: we cannot wait indefinitely for the environment; it is also computable and allows us to make the sample.
• The definition now includes a sample of environments, instead of all environments. The most important constraint to make this sample more discriminative is that the environment must be reward-sensitive.
• In the anytime versions of the test, the complexity of the environments is also progressively adjusted in order to make the test more effective and less dependent on the chosen distribution and preference over simple or complex environments.
• Interactions are not infinite. Rewards are averaged by the number of actions instead of accumulated. This makes the score expectation less dependent on the available test time.
• Time is included. The agent can only play with a single environment for a fixed time. This time limit progressively grows to make the test anytime.
• Rewards and penalties are both included (rewards can range from −1 to 1). Environments are required to be balanced, meaning that a random agent would score 0 in the limit in these environments. Otherwise, a very inept but proactive/quick agent would obtain good results.
A very important challenge which might have strong and direct implications in many fields (e.g., artificial intelligence, psychometrics, comparative cognition, and philosophy) were given through these concepts. A set of tests and, especially, an anytime intelligence test that can be applied to any kind of intelligent system (biological or artificial, fast or slow) were developed. The name anytime comes from the idea that we can obtain a rough approximation of the intelligence of an agent in a small amount of time and much better approximations in more time.
The term also originates from the fact that we introduce time in the definition of intelligence and we also adapt the time scale to the agent’s in order to be able to evaluate very slow and very fast intelligent agents, by also incorporating these times into the measurement. The acceptance and use of these tests could allow new research breakthroughs to take place:
Progress in AI could be boosted because systems could be evaluated.
• New generations of CAPTCHAS that take the ideas of -anytime intelligence test could be evolved.
• Certification would be devised to decide whether an unknown agent can be accepted as a service or a project.
• In the long term, these tests will be necessary to determine when we reach the “Technological Singularity”.
It represents the point at which one intelligent system is capable of constructing another intelligent system of the same intelligence. Much needs to be done on the reliability and optimality of the test. Constructs from Computerized Adaptive Testing and Item Response Theory (IRT) can be adapted here. The relation between speed and intelligence is also an area where further research is needed. It may be possible to develop tests that are able to measure intelligence and speed at the same time, without a batch combination of tests. There is also much theoretical work ahead.
Some of the assumptions made in some of the definitions could be presumably refined or improved. Some theoretical results could be obtained for some of the tests (convergence, optimality, etc.), as well as some expected scores proven for different kinds of agents and classes of environments.