Friday 25 April 2014

Machine Learning

What is Machine Learning?
Any study of Machine Learning should begin with a formal definition of what is meant
by Learning. A definition due to Simon (1983) is one of the best:
“Learning denotes changes in the system that are adaptive in the sense that they
enable the system to do the same task (or tasks drawn from a population of
similar tasks) more effectively the next time.”
We can easily extend this definition easily to our AI systems:
“Machine learning denotes automated changes in an AI system that are adaptive
in the sense that they enable the system to do the same task (or tasks drawn from
a population of similar tasks) more effectively the next time.”
The details of Machine Learning depend on the underlying knowledge representations,
e.g. learning in neural networks will be very different to learning in rule based systems.
The strategies for learning can be classified according to the amount of inference the
system has to perform on its training data. In increasing order we have
1. Rote learning – the new knowledge is implanted directly with no inference at
all, e.g. simple memorisation of past events, or a knowledge engineer’s direct
programming of rules elicited from a human expert into an expert system.
2. Supervised learning – the system is supplied with a set of training examples
consisting of inputs and corresponding outputs, and is required to discover the
relation or mapping between then, e.g. as a series of rules, or a neural network.
3. Unsupervised learning – the system is supplied with a set of training examples
consisting only of inputs and is required to discover for itself what appropriate
outputs should be, e.g. a Kohonen Network or Self Organizing Map.
Early expert systems relied on rote learning, but for modern AI systems we are
generally interested in the supervised learning of various levels of rules.
“An agent is anything that can be viewed as perceiving its environment through sensors
and acting upon that environment through actuators.” (Russell & Norvig, page 32)
It is no accident that this diagram is of exactly the same form as the diagram for the
interaction of the human nervous system with its environment.
Intelligent Agents
Agents are comprised of an architecture (e.g. a computer) plus a program that runs on
that architecture. In this module we are primarily interested in designing the programs.
In designing intelligent systems there are four main factors to consider:
P Percepts – the inputs to our system
A Actions – the outputs of our system
G Goals – what the agent is expected to achieve
E Environment – what the agent is interacting with
We shall consider four types of agent system of increasing sophistication:
1. Simple Reflex Agents
2. Reflex Agents with an Internal State
3. Goal based agents
4. Utility based agents
The Agent Environments
We have talked a lot about intelligent agents, but little about the environments with
which they interact. There are five particularly useful dimensions along which one can
categorize the properties of an agent’s environment:
Accessible vs. Inaccessible – An environment is accessible to an agent if the agent’s
sensory apparatus gives it access to the complete state of the environment. The
agent will not then need to maintain an internal state to keep track of the world.
Deterministic vs. Non-deterministic – The environment is deterministic if its next
state is determined completely by its current state and the actions of its agents. A
deterministic but inaccessible environment may appear non-deterministic to the
agent. Most real environments are so complex that, in practice, they have to be
treated as non-deterministic.
Discrete vs. Continuous – A discrete environment has a limited/finite number of
distinct, clearly defined percepts and actions.
Importance of Search
searches is determined by a search strategy. In this
lecture we shall begin by looking at a number of uninformed (blind) search strategies,
i.e. search strategies that do not use any information about the distance to the goal.
Then we will have an overview of some important informed search strategies.
State Space Representations
The state space is simply the space of all possible states, or configurations, that our
system may be in. Generally, of course, we prefer to work with some convenient
representation of that search space.
There are two components to the representation of state spaces:
 State Space Graphs
If the number of possible states of the system is small enough, we can represent all of
them, along with the transitions between them, in a state space graph, e.g.



 Routes Through State Space
Our general aim is to search for a route, or sequence of transitions, through the state
space graph from our initial state to a goal state.
Sometimes there will be more than one possible goal state. We define a goal test to
determine if a goal state has been achieved.
The solution can be represented as a sequence of link labels (or transitions) on the state
space graph. Note that the labels depend on the direction moved along the link.
Sometimes there may be more than one path to a goal state, and we may want to find
the optimal (best possible) path. We can define link costs and path costs for measuring
the cost of going along a particular path, e.g. the path cost may just equal the number of
links, or could be the sum of individual link costs.
For most realistic problems, the state space graph will be too large for us to hold all of it
explicitly in memory at any one time.
Search Trees
It is helpful to think of the search process as building up a search tree of routes through
the state space graph. The root of the search tree is the search node corresponding to
the initial state. The leaf nodes correspond either to states that have not yet been
expanded, or to states that generated no further nodes when expanded.
At each step, the search algorithm chooses a new unexpanded leaf node to expand. The
different search strategies essentially correspond to the different algorithms one can use
to select which is the next mode to be expanded at each stage




 BFS expands the leaf node with the lowest path cost so far, and keeps going until a goal
node is generated. If the path cost simply equals the number of links, we can implement
this as a simple queue (“first in, first out”).
This is guaranteed to find an optimal path to a goal state. It is memory intensive if the
state space is large. If the typical branching factor is b, and the depth of the shallowest goal state is d – the space complexity is O(bd), and the time complexity is O(bd). 

DFS expands the leaf node with the highest path cost so far, and keeps going until a goal
node is generated. If the path cost simply equals the number of links, we can implement
this as a simple stack (“last in, first out”).
This is not guaranteed to find any path to a goal state. It is memory efficient even if the
state space is large. If the typical branching factor is b, and the maximum depth of the
tree is m (possibly ¥) – the space complexity is O(bm), and the time complexity is O(bm).
What is Knowledge?
The Chambers 20th Century Dictionary provides as good a definition as any:
knowledge, nol¢ij, n. assured belief; that which is known; information; …
In order to solve the complex problems encountered in AI, one generally needs a large
amount of knowledge, and suitable mechanisms for representing and manipulating all
that knowledge.
Knowledge can take many forms. Some simple examples are:
John has an umbrella
It is raining
An umbrella stops you getting wet when it’s raining
An umbrella will only stop you getting wet if it is used properly
Umbrellas are not so useful when it is very windy
So, how should an AI agent store and manipulate knowledge like this?
What is a Knowledge Representation?
The object of a knowledge representation is to express knowledge in a computer
tractable form, so that it can be used to enable our AI agents to perform well.
A knowledge representation language is defined by two aspects:
1. Syntax The syntax of a language defines which configurations of the components
of the language constitute valid sentences.
2. Semantics The semantics defines which facts in the world the sentences refer to,
and hence the statement about the world that each sentence makes.
This is a very general idea, and not restricted to natural language.
Suppose the language is arithmetic, then
x’, ‘³’ and ‘y’ are components (or symbols or words) of the language
the syntax says that ‘x ³ y’ is a valid sentence in the language, but ‘³ ³ x y’ is not
the semantics say that ‘x ³ y’ is false if y is bigger than x, and true otherwise
Requirements of a Knowledge Representation
A good knowledge representation system for any particular domain should possess the
following properties:
1. Representational Adequacy – the ability to represent all the different kinds of
knowledge that might be needed in that domain.
2. Inferential Adequacy – the ability to manipulate the representational structures to
derive new structures (corresponding to new knowledge) from existing structures.
3. Inferential Efficiency – the ability to incorporate additional information into the
knowledge structure which can be used to focus the attention of the inference
mechanisms in the most promising directions.
4. Acquisitional Efficiency – the ability to acquire new information easily. Ideally
the agent should be able to control its own knowledge acquisition, but direct
insertion of information by a ‘knowledge engineer’ would be acceptable.
Finding a system that optimises these for all possible domains is not going to be feasible
First Order Logic as a Knowledge Representation
We can combine sentences by the ‘rules of logic’ to produce new sentences, e.g.
Øman(Chris)
Øman(x) Þ woman(x)
woman(Chris)
As a knowledge representation, first order logic has pros and cons:
Advantages
1. It is very expressive.
2. It has unambiguous syntax and semantics.
Disadvantage
1.   There is no generally efficient procedure for processing knowledge
Sources of Uncertainty
There are many potential sources of uncertainty that AI systems (such as expert
systems) must be able to cope with, but most can be attributed to one of:
Imperfect Domain Knowledge
The theory of the domain may be vague or incomplete. Incompleteness
necessitates the use of rules of thumb (or heuristics) which may not always give
optimal or correct results. Even if the domain theory is complete, an expert may
use approximations or heuristics to save time or simplify the problem solving.
Imperfect Case Data
Sensors have only finite resolving power and less than 100% reliability. Human
reports may be ambiguous or inaccurate. Evidence from different sources may be
missing or in conflict. Even if exact data were available, it may be too costly in
time or resources to get it.
Whatever the source of uncertainty, we need our AI systems to be able to deal with it.
Comparing Methods of Inexact Reasoning
We can now compare the main alternative approaches for treating uncertainty:
Bayesian Probability Theory / Bayesian Networks : This is the obvious consistent
approach, but it is often extremely computationally intensive unless we can make good
use of conditional independencies among the variables.
Dempster-Shafer Theory : This allows one to state that certain prior and conditional
probabilities cannot be assessed, and provides the notion of a compatibility relation
between beliefs. It appears to be a consistent generalization of probability theory.
Fuzzy Logic : This is another consistent alternative to standard probability theory.
However, the fuzzification of truth values is inconsistent with the basic idea of
conditional probabilities because of the modified definition of conjunction. It also
involves the non-standard and counter-intuitive property f(A Ú ØA) ¹ f(True).
Fuzzy Set Theory
Another useful alternative to probability theory for uncertain reasoning in AI systems is
fuzzy logic. It is based on the idea that many concepts are not sharply defined (e.g. ‘fast’,
‘tall’, ‘hot’) and consequently we cannot use standard set theory and if-then rules when
reasoning with them. Fuzzy logic is built upon the underlying idea of fuzzy set theory.
Classical set theory is based on two-valued logic, in which relations such as X Î S are
either true or false. Such classical sets are sometimes called crisp sets. We can define a
crisp set of fast cars as those cars which have a top speed greater than 150mph :
FastCars = { X Î Cars : TopSpeed(X) > 150mph }
But the concept of fast car is not really precise like that. It is more reasonable to define it
as a fuzzy set with elements that are members to a certain degree. A fuzzy set is thus
defined as a function from the appropriate domain to the interval [0, 1] such that f(X) = 1
denotes X is definitely a member, f(X) = 0 denotes X is definitely not a member, and other
values denote intermediate degrees of membership.
What are Neural Networks ?
1. Neural Networks (NNs) are networks of neurons, such as found in real (i.e.
biological) brains.
2. Artificial neurons are crude approximations of the neurons found in brains. They
may be physical devices, or purely mathematical constructs.
3. Artificial Neural Networks (ANNs) are networks of artificial neurons, and hence
constitute crude approximations to parts of real brains. They may be physical
devices, or simulated on conventional computers.
4. From a practical point of view, an ANN is just a parallel computational system
consisting of many simple processing elements connected together in a specific
way in order to perform a particular task.
5. One should never lose sight of how crude the approximations are, and how oversimplified
our ANNs are compared to real brains.
What are Artificial Neural Networks used for?
As with the field of AI in general, there are two basic goals for neural network research:
Brain modelling : The scientific goal of building models of how real brains work.
This can potentially help us understand the nature of human intelligence, formulate
better teaching strategies, or better remedial actions for brain damaged patients.
Artificial System Building : The engineering goal of building efficient systems for
real world applications. This may make machines more powerful, relieve humans
of tedious tasks, and may even improve upon human performance.
Basic Components of Biological Neurons
1. The majority of neurons encode their activations or outputs as a series of brief
electrical pulses (i.e. spikes or action potentials).
2. The neuron’s cell body (soma) processes the incoming activations and converts
them into output activations.
3. The neuron’s nucleus contains the genetic material in the form of DNA. This is
the same as in most types of cells, not just neurons.
4. Dendrites are fibres which emanate from the cell body and provide the receptive
zones that receive activation from other neurons.
5. Axons are fibres acting as transmission lines that send activation to other neurons.
6. The junctions that allow signal transmission between the axons and dendrites are
called synapses. The process of transmission is by diffusion of chemicals called
neurotransmitters across the synaptic cleft.
The basic artificial neuron is the following simplified model of a biological neuron:
1. A set of synapses (i.e. connections) brings in activations from other neurons.
2. The processing unit sums the inputs, and then applies a non-linear activation/
squashing/transfer/threshold function f(x).
.   An output line transmits the result to other neurons.

Real world applications
Financial modelling – predicting stocks, shares, currency exchange rates
Other time series prediction – climate, weather, airline marketing tactician
Computer games – intelligent agents, backgammon, first person shooters
Control systems – autonomous adaptable robotics, microwave controllers
Pattern recognition – speech recognition, hand-writing recognition, sonar signals
Data analysis – data compression, data mining, PCA, GTM
Noise reduction – function approximation, ECG noise reduction
Bioinformatics – protein secondary structure, DNA sequencing
Inference Rules
We are already familiar with the kind of rules our AI systems may use, e.g.
Deductive Inference Rule
Modus Ponens
Given “A” and “A implies B”, we can
conclude “B”:
A
A Þ B
B
Example:
It is raining
If it is raining, the street is wet
The street is wet
Abductive Inference Rule
Abduction
Given “B” and “A implies B”, it might
be reasonable to expect “A”:
B
A Þ B
A
Example:
The street is wet
If it is raining, the street is wet
It is raining.

Forward Chaining
Forward chaining or data-driven inference works by repeatedly: starting from the current
state, matching the premises of the rules (the IF parts), and performing the corresponding
actions (the THEN parts) that usually update the knowledge base or working memory.
The process continues until no more rules can be applied, or some cycle limit is met, e.g
The Forward Inference Chain
In this example there are no more rules, so we can draw the inference chain:
This seems simple enough, but in this case we only had a few initial facts, and a few
rules. Generally, things will not be so straight forward.
Disadvantages of Forward Chaining
1. Many rules may be applicable at each stage – so how should we choose which one
to apply next at each stage?
2. The whole process is not directed towards a goal, so how do we know when to
stop applying the rules?

Backward Chaining
Backward chaining or goal-driven inference works towards a final state by looking at the
working memory to see if the sub-goal states already exist there. If not, the actions (the THEN
parts) of the rules that will establish the sub-goals are identified, and new sub-goals are set up
for achieving the premises of those rules (the IF parts). The previous example now becomes:

What is an Expert System?
Jackson (1999) provides us with the following definition:
An expert system is a computer program that represents and reasons with knowledge
of some specialist subject with a view to solving problems or giving advice.
To solve expert-level problems, expert systems will need efficient access to a
substantial domain knowledge base, and a reasoning mechanism to apply the
knowledge to the problems they are given. Usually they will also need to be able to
explain, to the users who rely on them, how they have reached their decisions.
They will generally build upon the ideas of knowledge representation, production rules,
search, and so on, that we have already covered.
Often we use an expert system shell which is an existing knowledge independent
framework into which domain knowledge can be inserted to produce a working expert
system. We can thus avoid having to program each new system from scratch.
The Architecture of Expert Systems
The process of building expert systems is often called knowledge engineering. The
knowledge engineer is involved with all components of an expert system:

The Backward Inference Chain
The first part of the chain works back from the goal until only the initial facts are
required, at which point we know how to traverse the chain to achieve the goal state.
Advantage of Backward Chaining

1. The search is goal directed, so we only apply the rules that are necessary to
achieve the goal.
Disadvantage of Backward Chaining
1. A goal has to be known. Fortunately, most AI systems we are interested in can be
formulated in a goal based fashion.


What Exactly is AI?
“Artificial Intelligence (AI) is the part of computer science concerned with designing
intelligent computer systems, that is, systems that exhibit characteristics we associate
with intelligence in human behaviour – understanding language, learning, reasoning,
solving problems, and so on.” (Barr & Feigenbaum, 1981)
Engineering Goal To solve real world problems using AI techniques such as
knowledge representation, learning, rule systems, search, and so on.
Scientific Goal To determine which ideas about knowledge representation, learning,
rule systems, search, and so on, explain various sorts of real intelligence.
AI has roots in a number of older fields: Philosophy, Logic/Computation, Psychology/
Cognitive Science, Biology/Neuroscience, Evolution. Note: Strong AI v. Weak AI.
AI has many sub-fields (such as Neural Networks, Evolutionary Computation, Expert
Systems, Natural Language Processing, Planning, Robotics, Vision), but they employ
common techniques (such as Representation, Learning, Rules, Search).

DEFMACRO

DEFMACRO  stands for DEFine MACRO.
 The basic skeleton of a DEFMACRO is quite similar to the skeleton of a DEFUN.
(defmacro name (parameter*)
  "Optional documentation string."
  body-form*)
Like a function, a macro consists of a name, a parameter list, an optional documentation
string, and a body of Lisp expressions.
 However,  the job of a macro isn't to do anything directly--its job is to generate code that
will later do what we want.
(defmacro mac1 (a b) `(+ ,a (* ,b 3)))  
 
(mac1 4 5) =>  19 
 

DEFUN defines named functions

We can define a named function using the DEFUN form:
> (defun secret-number (the-number)

     
        (let ((the-secret 37))

       
          (cond ((= the-number the-secret) 'that-is-the-secret-number)

            
               ((< the-number the-secret) 'too-low)

            
                  ((> the-number the-secret) 'too-high))))
 

-> SECRET-NUMBER
The DEFUN form has three arguments:
  1. The name of the function: SECRET-NUMBER,

  1. A list of argument names: (THE-NUMBER), which will be bound to the function's parameters when it is called, and

  1. the body of the function: (LET ...).
car and cdr of a list
car returns the first element of list .
cdr returns the rest of the list.
Eg: (car '(1 2 3 4))
     -> 1
(cdr '( 1 2 3 4))
-> ( 2 3 4)
4.      Similarly we can have the following:
5.  (car x)    ==  (first x)
6.  (cadr x)   ==  (second x) ==  (car (cdr x))
7.  (caddr x)  ==  (third x)  ==  (car (cdr (cdr x)))
8.  (cadddr x) ==  (fourth x) ==  (car (cdr (cdr (cdr x))))
9.       RPLACA, RPLACD
1   A list is constructed of CONS cells. Each CONS has two parts, a CAR and a CDR .  The CAR holds the data for one element of the list, and the CDR holds the CONS that makes up the head of the rest of the list.
1       By using RPLACA and RPLACD to change the two fields of a CONS, we can alter the normal structure of a list. For example, we could splice out the second element of a list like this:
4  ? (defparameter *my-list* (list 1 2 3 4))
  *MY-LIST*
     ? (rplacd *my-list* (cdr (cdr *my-list*)))
   (1 3 4)
   ? *my-list*
   (1 3 4)
Not All Comparisons are Equal
Lisp has a core set of comparison functions that work on virtually any kind of object. These are:
  • EQ
  • EQL
  • EQUAL
  • EQUALP
The tests with the shorter names support stricter definitions of equality.
The tests with the longer implement less restrictive, perhaps more intuitive, definitions of equality.
EQ is true for identical symbols
EQ is true for identical symbols. In fact, it's true for any identical object. In other words,
an object is EQ to itself. Even a composite object, such as a list, is EQ to itself. (But two
lists are not EQ just because they look the same when printed; they must truly be the same
list to be EQ.) Under the covers, EQ just compares the memory addresses of objects.
EQ is not guaranteed to be true for identical characters or numbers.
This is because most Lisp systems don't assign a unique memory address to a particular
number or character; numbers and characters are generally created as needed and stored
temporarily in the hardware registers of the processor.
EQL is also true for identical numbers and characters
EQL retains EQ's notion of equality, and extends it to identical numbers and characters.
Numbers must agree in value and type; thus 0.0 is not EQL to 0. Characters must be truly
identical; EQL is case sensitive.
EQUAL is usually true for things that print the same
EQ and EQL are not generally true for lists that print the same.
Lists that are not EQ but have the same structure will be indistinguishable when printed;
they will also be EQUAL.
Strings are also considered EQUAL if they print the same. Like EQL, the comparison of
characters within strings is case-sensitive.
EQUALP ignores number type and character case
EQUALP is the most permissive of the core comparison functions. Everything that is EQUAL
is also EQUALP. But EQUALP ignores case distinctions between characters, and applies the
(typeless) mathematical concept of equality to numbers; thus 0.0 is EQUALP to 0.
Furthermore, EQUALP is true if corresponding elements are EQUALP in the following composite data types:
  • Arrays
  • Structures
  • Hash Tables
Longer tests are slower; know what we're comparing
The generality of the above longer-named tests comes with a price.
They must test the types of their arguments to decide what kind of equality is applicable;
this takes time.
EQ is blind to type of an object; either the objects are the same object, or they're not. This
kind of test typically compiles into one or two machine instructions and is very fast.
We can avoid unnecessary runtime overhead by using the most restrictive (shortest-
named) test that meets we need.