Pages

Playing with Foursquare API with Python

Wednesday, December 21, 2011

Hi all,

I'd like to share a project that I am developing that it may be useful for anyone who wants to create datasets from mobile location networks.  Specifically, I developed a wrapper in Python for accessing the Foursquare API called PyFoursquare

For anyone who doesn't know what is Foursquare, it is a popular mobile social-location network with more 10.000.000 of users around the world. The idea is that you can share your current location with your friends and as result discover new places, find where your friends are and even check some tips and recommendations about a place and what to do when you arrive there. It is an amazing project with lots of data available for anyone who wants to develop new apps for connect or mine (data mining) its data!

Foursquare Mobile Application

This Python API is one of the results of my master degree project where I proposed a new architecture for mobile recommenders that fetches reviews from social networks to improve the explanation and the quality of the given recommendations.  I  used this library to collect tips (text reviews) from Foursquare from places at my neighborhood Recife, Brazil.  This API was a little messy, so I decided to clean it up, organize and documment it for publish for the open-source community.

One of advantages of this API is that you can handle each entity from the Foursquare data as Model object. So instead of handling with json dictionaries, I encapsulate the results in the respective models (Venue, Tips, User, etc.) and access its attributes as common object in Python!

I inspired myself at the work of Joshua at Tweepy, which is a Python library for Twitter.  In this version released 0.0.1 I only implemented some API's such as search/venues,  venue_details and venue_tips.  In future releases I pretend to add more models and support for more API methods available at Foursquare.

How can you use it at your project ?

It is simple! Just install it by downloading at the Github's home project, extract the source from the tar.gz and  at the directory of the project run the command below:

$ python setup.py install

or the easier way is to install by the command easy_install:

$ easy_install pyfoursquare


After that, you can  simple test by running the command below at your Python Shell

>>> import pyfoursquare


Now let's see how you can get started with the PyFoursquare:

First you need to create an application at Foursquare. The link is this.  There  you can also get further information about the API, another libraries and several applications using the Foursquarw API's.  

The Foursquare Developer's Settings


After creating your application, you must get the client_id and your client_secret. Those keys will be important to connect the app to the users' accounts.  Foursquare uses the secure authentication based on OAuth2.  In PyFoursquareAPI, you won't need to handle with all steps provided by OAuth2.  It already encapsulates all the steps and handshakes between your app and Foursquare servers. \m/ 

Below the  code you must write for authenticate an user to connect to your app:




After the user  authorized, you now can instantiate the PyFoursquare API.  It will give you access to the Foursquare API methods.  I implemented several methods, but feel free to add new ones! Don't forget to submit the final results as pull requests at the project's repository at Github.

In this example I fetched a venue by giving as input the latitude and longitude and querying for the place with the name 'Burburinho'.  Burburinho is a popular bar nearby where I work!

Source code




Now you can access the result and access the Venue as a Python Object. All elements of the Venue are represented as attributes of the object Venue at PyFoursquare. The goal is to make easier the life of the developer when he access the Foursquare API by parsing all the JSON (the result) and placing in the correct model for him.



I expect you enjoyed this API. Feel free to use it at your applications or research!  I'd like to thank the Foursquare team for expose their data by providing those API's!  For data mining researchers instered in mobile location data, it is a mine of gold!

Further information about PyFoursquare, you can find here.

Feel free to give sugestions, improvements and comments,

Regards,

Marcel Caraciolo

Machine Learning with Python: Meeting TF-IDF for Text Mining

Monday, December 19, 2011

3Hi all,

This month I was studying about information retrieval and text mining, specially how to convert the textual representation of information into a Vector Space Model (VSM).  The VSM is an algebraic model representing the importance of a term (tf-idf) or even the absence or presence (Bag of Words) of it in a document. I'd like to mention the excellent post from the researcher Christian Perone at his blog Pyevolve about Machine learning and Text Mining with TF-IDF, a great post to read.

I decided in this post to be shorter and give some examples using Python . I expect at the end of this post you feel confortamble to use tf-idf at your tasks handling with text mining.

By the way, I extremely recommend you to check the scikit.learn machine learning toolkit. There is a whole package to work with text classification, including TF-IDF with Python!


What is TF-IDF ?

Term Frequency - Inverse Document Frequency is a weighting scheme that is commonly used in information retrieval tasks. The goal is to model each document into a vector space, ignoring the exact ordering of the words in the document while retaining information about the occurrences of each word.

It is composed by two terms: one first computes the normalized Term Frequency, which is the number of times a word appears in a documnet, divided by the total number of words in that document. Then, the second term is the Inverse Document Frequency, which is computed as the logarithm of the number of the documents in the corpus divided by the number of documents where the term ti appears. Or, in symbols:



and 




The TF-IDF gives how important is a word to a document in a collection, since it takes in consideration not only the isolated term but also the term within the document collection. The intuition is that a term that occurs frequently in many documents is not a good discriminator ( why emphasize a term which is almost present in the entire corpus of your documents ?)  So it will scale down the frequent terms while scaling up the rare terms; for instance, a term that occurs 10 times more than another isn't 10 times more important thant it.

For computing the TF-IDF weights for each document in the corpus, it is required in the corpus a series of steps:  1) Tokenize the corpus  2)  Model the Vector  Space  and 3) Compute the TF-IDF weight for each document in the corpus.

Let's going through each step:


Tokenization


First we need to tokenize the text. To do this, we can use the NLTK library which is a collection of natural language processing algorithms written in Python. The process of tokenizing the documents in the corpous is a two steps:  First the text is splint into sentences, and then the sentences are split into the individual words. It is important to notice that there are several words that are not relevant, that is, terms like "the, is, at, on", etc...  aren't going to help us, so in the information extraction, we ignore them. Those words are commonly called stop words and they are present in almost all documents, so it is not relevant for us. In portuguese we also have those stop words such as (a, os , as , os, um , umas, que, etc.)

So considering our example below:


We will tokenize this collection of documents and represent them as vectors (rows) of a matrix with |D| x F shape, where |D|  is the cardinality of the document space, or how many documents we have and the F is the number of features, in our example it is represented by the vocabulary size.

So the matrix representation of our vectors above is:



As you have noticed, these matrices representing the term frequencies (tf) tend to be very sparse (lots of  zero-elements),  so you will usually see the representation of these matrices as sparse matrices. The code shown below will tokenize each document in the corpus and compute the term frequencies.



Model the Vector Space

Now that each of the documents in the corpus has been tokenized, the next step is to compute the document frequency quantity, that is, for each term, how many documents that term appears in. Before going to IDF, it is important to normalize the term-frequencies. Why ?  Imagine that we have a repeated term in document with porpuse of improving its ranking on an Information Retrieval System or even create a bias torwards long documents, making them look more important than they are just because of the high frequency of the term in the document. By normalizing the TF vector we can overcome this problem.
The code.



Compute the TF-IDF

Now that you saw how the vector normalization was applied, we will now have to compute the second term of tf-idf: the inverse document frequency. The code is provided below:




The TF-IDF is the product between the TF and IDF.  So a high weight of the tf-idf  is reached when you have a high term frequency (tf) in the given document and low document frequency of the term in the whole collection. Now let's see the tf-idf computed for each term present in the vector space.

The code.



Putting everything together, the following code will compute the TF-IDF weights for each document. And the result matrix it will be:




A row of this matrix would be:



I ommited the zero-values elements of the row.

If we would decide to check the most relevant words for this place, by using the tf-idf I could see that the place has a nice hot chocolate drink (0.420955 <= chocolate quente ótimo), the soft drink nega maluca is also delicious (0.315716 - nega maluca uma delicia),  its Cheese bun is also quite good (0.252573 - pao de queijo muito bom).

And that is how we comput  our M_{tf\mbox{-}idf} matrix.  You can take a look at this link and this one to know how to use it with GenSim and Scikit.Learn respectively.

That's all,  I hope that  you enjoyed this article and help more people to know how to implement the tf-idf weight to mine your collection of texts.  Feel free to comment and make suggestions.

The source code of this example is also available.

Regards,

Marcel Caraciolo

Annoucing a Scientific Computing With Python Course !

Wednesday, December 7, 2011

Hi all,

I am announcing the launch of the website PyCursos. Pycursos is a online-course and training platform for anyone who wants to learn Python programming language and its related extensions. The first course is being already announced, which is the Scientific Computing Programming with Python,  with me as teacher.




The goal of the course is to teach scientific computing, specially on how to solve scientific problems in your daily routine by using the packages that Python provides for free: Scipy, Numpy and Matplotlib.

With those tools, the student will learn how to reproduce their problems into a simple and legible code and to use the helpful tools to plot graphs, write reports, mathematics optmization, matrices manipulation, linear algebra and more.

The requirement to attend the course is only the student be motivated to learn and have some experience with programming.  The course will start in 2012, January in on-line mode, where the students will apply and follow a schedule of video-classes on-line and review exercises regularly.

We have also the option of the in-company training, where the student may watch the classes in a classroom with another students. In both modes the students, at the end of the course, will receive a conclusion certificate.

It is important to tell that the course now is all  in Portuguese! Sorry for anyone for another countries!


For further information please visit our website : http://www.pycursos.com

Anyone who whant to know more about scientific computing with python, can check out those  slides of a keynote that I lectured at some institutions from here at Recife, Pernambuco, Brazil.






Regards,

Marcel Caraciolo

Review of the book Numpy 1.5 - Beginner's Guide

Saturday, November 26, 2011

Hi all,

I'd like to share my review of the book Numpy 1.5 the Beginner's Guide by Ivan Idris, which is one of the latest books in a series of manuals covering scientific computing libraries written in Python.  This book covers the Numpy library for manipulating vectors and matrices and support for mathematical libraries.

Numpy 1.5  from Packt Publisher

Quick Review

The book is a great and useful resource for anyone who wants to explore further the Numpy scientific library since it covers almost all of the modules available at Numpy 1.5.  It comes with several examples, specially for finantial researchers and developers that work with finantial data. The author explored several modules using stocks and historical price data.  The authors explains each function or operation with code and the expected results, so the reader can follow precisely what's happening when he presents the modules. One of the values of the book is how it is organized: the step-by-step guide when he presents complex functions at Numpy, for example: add.reduceat, add.accumulate and add.reduce operators.
The part that I didn't like was about the exercises which was quite simple. I'd like to see deep exercises exploring the resources given at the book and I missed more information about NaN values. Also, I didn't  see information also about the functions squeeze, choose and about more complex structured arrays (arrays with tuples, etc.).

To sum up, I recommend this book for anyone whishing to learn about scientific computing with Python using the mathematical library Numpy which is a great alternative (and free !) for Matlab, Mathematica and other packages. I expect quite soon a book covering Scipy library also!  By the way, the finantial fans will love this book since it covers almost of the entire book with examples using finance data!

Review


The book starts with a step-by-step installation process of Numpy as also giving a litte introduction about what is Numpy, its history, etc.   I'd like to mention that even all the platforms covered at the book, Numpy is not so easy as mentioned to install at Mac OS.  The problem is that generally the developers don't use the built-in Python that comes with the Mac, since it is outdated (my Snow leopard comes with the Python 2.6.1). So when you install the new Python, that the problems come! Several compilation errors, messages that you can't understand, etc.  But if you go by using the MacPorts,  you will free of all these errors! ( After all the nightmare of the installation, I discovered the MacPorts :P).

The following chapters 2-4 presents the Numpy Fundamentals covering the array manipulations and most commonly used operations.  The books goes into a cyclic process, where each function that the author presents goes through an introduction about the problem to solve, the actions (how you with Numpy can solve), auxiliar numpy functions and operations and finally what just happened, that is, explain what he has done after showing the solution. The examples covered at book, most of them, are from finantial data and stock market values. An interesting choice since he used the same examples through the chapters in a progression and logical way.  Having each function and numpy featured described and explained made the book a good reference guide for someone using the library.  There were minor issues  related to the imports, he doesn't mention the imports in some examples,  for instance the numpy.loadtxt function when he uses the datetime module.  For a beginner that is studying Python for the first time, it may be harder to them to follow the examples, since he could not always tell where the functions or modules were coming from.

The second part of the book includes the matrices, universal functions, some scipy modules and the use of matplotlib and testing.  The chapter 5 covers the matrix module and universal functions such as add, divide, prod, sum and so on.  I missed some functions that weren't covered at this chapter such as numpy.choose or numpy.squeeze.  I believe the author didn't remember or didn't have space to mention these specific functions, but it does not prejudice at all the quality of the book. The chapter that I liked the most at the book was about testing. Several developers, special the scientific researchers are not used to test their code, so I believe it is a great chapter for anyone who wants to assure quality and avoid future bugs using Numpy testing modules.  The chapter should be more bigger and include more examples even creating test cases and tips for scientific developers.
Finally the last two chapters focus on plotting and Scipy integration. I think the plotting chapter should be at the beginning of the book, because he already uses lots of examples at the previous chapters with matplotlib and only at the end explain further about the library. The chapter is well-written and gives you sufficient content for beginning with Matplotlib. The last chapter covers the use of several scipy functions but it does not give deeper explanations about how it works as he did at the previous chapters with Numpy. However it gives several useful examples to work with integration, image processing and even optimization. Many developers will enjoy this extra-chapter covering the use of scipy+numpy.


Conclusions

 My overall impression of this book is that it could make a useful reference guide for Numpy. For finantial researchers and developers it will be a great book since it also covers lots of examples using finance data to present the numpy fundamentals.  There were minor issues related to Scipy and Matplotlib that should be more explained. For anyone who wants to start using Numpy it can a be an excellent book to begin, since it covers all the fundamentals steps with a cyclic progressive introduction of using the scientific packages in Python.

Regards,

Marcel Caraciolo

Machine Learning with Python - Logistic Regression

Sunday, November 6, 2011

 Hi all,

I decided to start a new series of posts now focusing on general machine learning with several snippets for anyone to use with real problems or real datasets.  Since I am studying machine learning again with a great course online offered this semester by Stanford University, one of  the best ways to review the content learned is to write some notes about what I learned. The best part is that it will include examples with Python, Numpy and Scipy. I expect you enjoy all those posts!

The series:


In this post I will cover the Logistic Regression and Regularization.


Logistic Regression


Logistic Regression is a type of regression that predicts the probability of ocurrence of an event by fitting data to a logit function (logistic function).  Like many forms of regression analysis, it makes use of several predictor variables that may be either numerical or categorical. For instance, the probability that a person has a heart attack within a specified time period might be predicted from knowledege of the person's age, sex and body mass index. This regression is quite used in several scenarios such as prediction of customer's propensity to purchase a product or cease a subscription in marketing applications and many others. 


Visualizing the Data



Let's explain the logistic regression by example. Consider you are the administrator of a university department and you want to determine each applicant's chance of admission based on their results on two exams. You have the historical data from previous applicants that you can use as a trainning set for logistic regression.  For each training example, you have the applicant's scores on two exams and the admissions decision.   We will use logistic regression to build this model that estimates the probability of admission based the scores from those two exams.

Let's first visualize our data on a 2-dimensional plot as show below. As you can see the axes are the two exam scores, and the positive and negative examples are shown with different markers.

Sample training visualization

The code


Costing Function and Gradient



The logistic regression hypothesis is defined as:

where the function g  is the sigmoid function. It is defined as:


The sigmoid function has special properties that can result values in the range [0,1].  So you have large positive values of X, the sigmoid should be close to 1, while for large negative values,  the sigmoid should be close to 0.

Sigmoid Logistic Function 

The cost function and gradient for logistic regression is given as below:


and the gradient of the cost is a vector theta where the j element is defined as follows:



You may note that the gradient is quite similar to the linear regression gradient, the difference is actually because linear and logistic regression have different definitions of h(x).

Let's see the code:






Now to find the minimum of this cost function, we will use a scipy built-in function called fmin_bfgs.  It will find the best parameters theta for the logistic regression cost function given a fixed dataset (of X and Y values).
The parameters are:
  • The initial values of the parameters you are trying to optimize;
  • A function that, when given the training set and a particular theta, computes the logistic regression cost and gradient with respect to theta for the dataset (X,y).

The final theta value will then be used to plot the decision boundary on the training data, resulting in a figure similar to the figure below.






Evaluating logistic regression



Now that you learned the parameters of the model, you can use the model to predict whether a particular student will be admited. For a student with an Exam1 score of 45 and an Exam 2 score of 85, you should see an admission probability of 0.776.

But you can go further, and evaluate the quality of the parameters that we have found and see how well the learned model predicts on our training set.  If we consider the threshold of 0.5 using our sigmoid logistic function, we can consider that:


Where 1 represents admited and -1 not admited.

Going to the code and calculate the training accuracy of our classifier we can evaluate the percentage of examples it got correct.  Source code.



89% , not bad hun?! 



Regularized logistic regression



But when your data can not be separated into positive and negative examples by a straight-line trought the plot ?  Since our logistic regression will be only be able to find a linear decision boundary, we will have to fit the data in a better way. Let's go through an example.

Suppose you are the product manager of the factory and you have the test results for some microships  of two different tests. From these two tests you would like to determine whether the microships should be accepted or rejected.  We have a dataset of test results on past microships, from which we can build a logistic regression model.  





Visualizing the data



Let's visualize our data. As you can see in the figure below, the axes are the two test scores, and the positive (y = 1, accepted) and negative (y = 0, rejected) examples are shown with different markers.
Microship training set 

You may see that the model built for this task may predict perfectly all training data and sometimes it migh cause some troubling cases.  Just because ithe model can perfectly reconstruct the training set does not mean that it had everything figured out.  This is known as overfitting.   You can imagine that if you  were relying on this model to make important decisions, it would be desirable to have at least of regularization in there. Regularization is a powerful strategy to combat the overfitting problem. We will see it in action at the next sections.





Feature mapping



One way to fit the data better is to create more features from each data point. We will map the features  into all polynomial terms of x1 tand x2 up to the sixth power.


As a result of this mapping, our vector of two features (the scores on two QA tests) has been transformed into a 28-dimmensional vector. A logistic regression classifier trained on this higher dimension feature vector  will have a more complex decision boundary and will appear nonlinear when drawn in our 2D plot.


Although the feature mapping allows us to buid a more expressive classifier, it also me susceptible to overfitting. That comes the regularized logistic regression to fit the data and avoid the overfitting problem.


Source code.


Cost function and gradient


The regularized cost function in logistic regression is :


Note that you should not regularize the parameter theta, so the final summation is for j = 1 to n, not j= 0 to n.  The gradient of the cost function is a vector where the jn element is defined as follows:



Now let's learn the optimal parameters theta.  Considering now those new functions and our last numpy optimization function we will be able to learn the parameters theta. 

The all code now provided (code)







Plotting the decision boundary



Let's visualize the model learned by the classifier. The plot will display the non-linear decision boundary that separates the positive and negative examples. 

Decision Boundary


As you can see our model succesfully predicted our data with accuracy of 83.05%.

Code





Scikit-learn



Scikit-learn is an amazing tool for machine learning providing several modules for working with classification, regression and clustering problems. It uses python, numpy and scipy and it is open-source!

If you want to use logistic regression and linear regression you should take consider the scikit-learn. It has several examples and several types of regularization strategies to work with.  Take a look at this link and see by yourself!  I recommend!




Conclusions



Logistic regression has several advantages over linear regression, one specially it is more robust and does not assume linear relationship since it may handle nonlinear effects. However it requires much more data to achieve stable, meaningful results.  There are another machine learning techniques to handle with non-linear problems and we will see in the next posts.   I hope you enjoyed this article!


All source from this article here.

Regards,

Marcel Caraciolo


Google AI Challenge this year is open! Ants Battlefield!

Friday, November 4, 2011

Hi all,

One more time Google and the University of Waterloo's computer science club have launched an Artificial Intelligence challenge. This year the task is to write a program to compete in the Ants Multiplayer Challenge. The goal is seek and destroy the most enemy ant hills while defending their own hills.  You must create a bot that plays the game of Ants as intelligently as possible.  The contest supports languages in Python, Java, C# and C++.

The current state of the contest, which you can submit your project, is until December 18th. After, there will be a final tournament between the contestants to decide the ultimate winner!

See  the game in action at the video below.





It is a great opportunity to learn Artificial Intelligence and play with your skills at programming, machine learning and logical reason!

Regards,

Marcel Caraciolo  


Machine Learning with Python - Linear Regression

Thursday, October 27, 2011

Hi all,

I decided to start a new series of posts now focusing on general machine learning with several snippets for anyone to use with real problems or real datasets.  Since I am studying machine learning again with a great course online offered this semester by Stanford University, one of  the best ways to review the content learned is to write some notes about what I learned. The best part is that it will include examples with Python, Numpy and Scipy. I expect you enjoy all those posts!


Linear Regression

In this post I will implement the linear regression and get to see it work on data.  Linear Regression is the oldest and most widely used predictive model in the field of machine learning. The goal is to  minimize the sum of the squared errros to fit a straight line to a set of data points.  (You can find further information at Wikipedia).

The linear regression model fits a linear function to a set of data points. The form of the function is:

Y = β0 + β1*X1 + β2*X2 + … + βn*Xn


Where Y is the target variable,  and X1X2, ... Xare the predictor variables and  β1β2, … βare the coefficients that multiply the predictor variables.  βis constant. 

For example, suppose you are the CEO of a big company of shoes franchise and are considering different cities for opening a new store. The chain already has stores in various cities and you have data for profits and populations from the cities.  You would like to use this data to help you select which city to expand next. You could use linear regression for evaluating the parameters of a function that predicts profits for the new store.

The final function would be:

                                                         Y =   -3.63029144  + 1.16636235 * X1


There are two main approaches for linear regression: with one variable and with multiple variables. Let's see both!

Linear regression with one variable

Considering our last example, we have a file that contains the dataset of our linear regression problem. The first column is the population of the city and the second column is the profit of having a store in that city. A negative value for profit indicates a loss.

Before starting, it is useful to understand the data by visualizing it.  We will use the scatter plot to visualize the data, since it has only two properties to plot (profit and population). Many other problems in real life are multi-dimensional and can't be plotted on 2-d plot.



If you run this code above (you must have the Matplotlib package installed in order to present the plots), you will see the scatter plot of the data as shown at Figure 1.

                               
                                  

Now you must fit the linear regression parameters to our dataset using gradient descent. The objective of linear regression is to minimize the cost function:


where the hypothesis H0 is given by the linear model:


The parameters of your model are the θ values. These are the values you will adjust to minimize cost J(θ). One way to do it is to use the batch gradient descent algorithm. In batch gradient, each iteration performs the update:



With each step of gradient  descent, your parameters θ, come close to the optimal values that will achieve the lowest cost J(θ).

For our initial inputs we start with our initial fitting parameters θ, our data and add another dimmension to our data  to accommodate the θo intercept term. As also our learning rate alpha to 0.01.



As you perform gradient descent to learn minimize the cost function J(θ), it is helpful to monitor the convergence by computing the cost. The function cost is show below:


A good way to verify that gradient descent is working correctly is to look at the value of J(θ) and check that it is decreasing with each step. It should converge to a steady valeu by the end of the algorithm.

Your final values for θ will be used to make predictions on profits in areas of 35.000 and 70.000 people.  For that we will use some matrix algebra functions with the packages Scipy and Numpy,  powerful Python packages for scientific computing.

Our final values as shown below:


                                                         Y =   -3.63029144  + 1.16636235 * X1


Now  you can use this function to predict your profits!  If you use this function with our data we will come with plot:



Another interesting plot is the contour plots, it will give you how J(θ) varies with changes in θo and  θ1.  The cost function J(θ) is bowl-shaped and has a global mininum as you can see in the figure below.



This minimum is the optimal point for θo and θi, and each step of gradient descent moves closer to this point.

All the code is shown here.



Linear regression with multiple variables



Ok, but when you have multiple variables ? How do we work with them using linear regression ? That comes the linear regression with multiple variables. Let's see an example:

Suppose you are selling your house and you want to know what a good market price would be. One way to do this is to first collect information on recent houses sold and make a model of housing prices.

Our training set of housing prices in Recife, Pernambuco, Brazil are formed by three columns  (three variables). The first column is the size of the house (in square feet), the second column is the number of bedrooms, and the third column is the price of the house.

But before going directly to the linear regression it is important to analyze our data. By looking at the values, note that house sizes are about 1000 times the number of bedrooms. When features differ by orders of magnitude, it is important to perfom a feature scaling that can make gradient descent converge much more quickly.



The basic steps are:

  • Subtract the mean value of each feature from the dataset.
  • After subtracting the mean, additionally scale (divide) the feature values by their respective “standard deviations.”

The standard deviation is a way of measuring how much variation there is in the range of values of a particular feature (most data points will lie within ±2 standard deviations of the mean); this is an alternative to taking the range of values (max-min).

Now that you have your data scaled, you can implement the gradient descent and the cost function.

Previously, you implemented gradient descent on a univariate regression problem. The only difference now is that there is one more feature in the matrix X. The hypothesis function and the batch gradient descent update rule remain unchanged.


In the multivariate case, the cost function can also be written in the following vectorized form:

                                                            J(θ)=12m(Xθy)T(Xθy)

After running our code, it will come with following function:

             215810.61679138,   61446.18781361,   20070.13313796


The gradient descent will run until convergence to find the final values of θ.  Next, we will this value of θ to predict the price of a house with 1650 square feet and 3 bedrooms.

                                                          θ:=θα1mxT(xθTy)



Predicted price of a 1650 sq-ft, 3 br house: 183865.197988


If you plot the convergence plot of the gradient descent you may see that convergence will decrease as the number of iterations grows.



The code for linear regression with multi variables is available here.

Extra Notes


The Scipy package comes with several tools for helping you in this task, even with a module that has a linear regression implemented for you to use!

The module is scipy.stats.linregress  and implements several other techniques for updating the theta parameters.  Check more about it here.

Conclusions


The goal of regression is to determine the values of the ß parameters that minimize the sum of the squared residual values (difference betwen predicted and the observed) for the set of observations. Since linear regression is restricted to fiting linear (straight line/plane) functions to data, it's not adequate to real-world data as more general techniques such as neural networks which can model non-linear functions.  But linear regression has some interesting advantages:


  • Linear regression is the most widely used method, and it is well understood.
  • Training a linear regression model is usually much faster than methods such as neural networks.
  • Linear regression models are simple and require minimum memory to implement, so they work well on embedded controllers that have limited memory space.
  • By examining the magnitude and sign of the regression coefficients (β) you can infer how predictor variables affect the target outcome.
  • It's is one of the simplest algorithms and available in several packages, even Microsoft Excel!



I hope you enjoyed this simple post, and in the next one I will explore another field of machine learning with Python! You can download the code at this link.


Marcel Caraciolo