Samuel M.H. 's technological blog

Friday, February 6, 2015

PageRank + Sparse Matrices+ Python (Ipython Notebook)

Notebook

PageRank

AUTHOR: Samuelm M.H. (samuel.mh@gmail.com)
DATE: 6-Feb-2015
DESCRIPTION: 
    A small study on how to implement efficiently the PageRank algorithm using Python.

Dataset

The chosen dataset is a directed graph where nodes represent pages from Stanford University (stanford.edu) and directed edges represent hyperlinks between them. This representation is suitable to be transformed into the adjacency matrix required for the PageRank algorithm.

Information: * URL: http://snap.stanford.edu/data/web-Stanford.html * Year: 2002 * Nodes: 281.903 (numeration starts at 1) * Edges: 2.312.497 * Format: id of origin (tabulator) id of destiny (data start at line 5 and there is no tuple duplication). Represents a hyperlink. * Size (uncompressed): 31,4MB

In [1]:
DATASET_ORIGINAL = "web-Stanford.txt" #Uncompress the .gz file!
NODES = 281903
EDGES = 2312497

First thoughts

We need to load the dataset into memory as an adjacency matrix. It can be a square matrix having a boolean value in each cell. The way NumPy implements this is with a dense matrix where each value is a Byte. Let's see how much memory is needed.

More than 74GB to store the adjacency matrix!! It doesn't fit in the RAM of my laptop.


Sparse Matrices

It seems reasonable that, if the dataset size is 31,4MB, there must be a representation of it that fits into memory. The key is that a single node is on average connected only to a few nodes (lets say 10). This means there are a lot of false values (not hyperlinks). As false values can be seen as a zeroes, we say the matrix is sparse. A good property of a sparse matrices is that their representation is much smaller as their dense equivalent.

There are several libraries to deal with sparse matrices in Python, but the one i've chosen is scipy.sparse

In [2]:
import numpy as np
from  scipy import sparse

Loading the dataset

The goal is to keep the adjacency matrix into a Compressed Sparse Row matrix. According to the CSR matrix documentation

Advantages of the CSR format: * Efficient arithmetic operations CSR + CSR, CSR * CSR, etc. * Efficient row slicing * Fast matrix vector products

So, the efficient row operations is what we need to apply the power method to compute the PageRank values.

Disadvantages of the CSR format * Slow column slicing operations (consider CSC) * Changes to the sparsity structure are expensive (consider LIL or DOK)

If we build an empty CSR matrix and try to modify its values we will get a warning like.
SparseEfficiencyWarning: 
    Changing the sparsity structure of a csr_matrix is expensive. 
    lil_matrix is more efficient. 

The first solution could be using a Dictionary Of Keys based sparse matrix and then converting it to a CSR matrix. But it turns out to be slow.

In [3]:
def dataset2dok():
    with open(DATASET_ORIGINAL,'r') as f:
        dokm = sparse.dok_matrix((NODES,NODES),dtype=np.bool)
        for line in f.readlines()[4:]:
            origin, destiny = (int(x)-1 for x in line.split())
            dokm[destiny,origin]=True
    return(dokm.tocsr())

%time dok_m = dataset2dok()
CPU times: user 34.6 s, sys: 256 ms, total: 34.9 s
Wall time: 34.8 s

Another solution could be building the Compressed Sparse Row matrix from its indices instead of modifying the cells.

In [4]:
def dataset2csr():
    row = []
    col = []    
    with open(DATASET_ORIGINAL,'r') as f:
        for line in f.readlines()[4:]:
            origin, destiny = (int(x)-1 for x in line.split())
            row.append(destiny)
            col.append(origin)
    return(sparse.csr_matrix(([True]*EDGES,(row,col)),shape=(NODES,NODES)))

%time csr_m = dataset2csr()
CPU times: user 5.7 s, sys: 60 ms, total: 5.76 s
Wall time: 5.74 s

How much memory does it take to store the adjacency (sparse) matrix?

According to the documentation, a CSR matrix is defined by: * dtype: data type of the matrix. Not really needed. * shape: 2-tuple shape of the matrix. * ndim: Number of dimensions (this is always 2). Not really needed. * data: CSR format data array of the matrix. * indices: CSR format index array of the matrix. * indptr: CSR format index pointer array of the matrix.

In [5]:
import sys

print "The size in memory of the adjacency matrix is {0} MB".format(
    (sys.getsizeof(csr_m.shape)+
    csr_m.data.nbytes+
    csr_m.indices.nbytes+
    csr_m.indptr.nbytes)/(1024.0**2)
)
The size in memory of the adjacency matrix is 12.1022920609 MB

Suming up. We have gone from 74GB to just 12MB by using a sparse matrix. That is pretty good!

Save/Load a CSR matrix

Despite the fact that 6 seconds is not a huge loading time, it will be nice to store the CSR matrix in a native format so to speed up the loading/saving times.

Here the trick is to store the arrays the matrix is defined by: shape, data, indices, indptr. Furthermore, it is possible to omit the data array taking into consideration that is an array of "True" values of length the number of edges.

In [6]:
def csr_save(filename,csr):
    np.savez(filename,
        nodes=csr.shape[0],
        edges=csr.data.size,
        indices=csr.indices,
        indptr =csr.indptr
    )

def csr_load(filename):
    loader = np.load(filename)
    edges = int(loader['edges'])
    nodes = int(loader['nodes'])
    return sparse.csr_matrix(
        (np.bool_(np.ones(edges)), loader['indices'], loader['indptr']),
        shape = (nodes,nodes)
    )
In [7]:
DATASET_NATIVE = 'dataset-native.npz'
csr_save(DATASET_NATIVE,csr_m)
%time csr = csr_load(DATASET_NATIVE)
CPU times: user 48 ms, sys: 0 ns, total: 48 ms
Wall time: 110 ms


PageRank Algorithm

When the PageRank algorithm is taught, the usual way to compute it consists on calculating the Google matrix \(A\).

\[ A = \beta M + \frac{1-\beta}{n}\mathbf{e} \cdot \mathbf{e}^T \]

Where: * \(\beta\) is the probability a user follows a link, so \(1-\beta\) is the teleportation factor. * \(n\) is the number of elements (nodes). * \(M\) is a matrix built like this: 1. The starting point is a graph or adjacency matrix \(G\) (shape \(n \times n\)). If there is a link from \(i\) to \(j\) then the element \(G_{j,i}=1\) 2. Normalize the matrix by columns, so the sum of every column is 1. 3. If there are columns that sum 0, then set their elements to \(1/n\). * \(\mathbf{e}\) is a vector of 1 of length \(n\).

And then apply the Power Method:

\[ \vec{r}_{(t+1)} = A \cdot \vec{r}_{(t)} \]

Until it converges:

\[ \lVert \vec{r}_{(t+1)} - \vec{r}_{(t)} \rVert _1 < \epsilon \]

Then, \(\vec{r}\) is an eigenvector of \(A\) with the PageRank scores. A property of this vector is:

\[ \sum^{n}_{i=1} r_i = 1 \]

This approach has a huge drawback since it has to compute the Google matrix \(A\), and, as seen before it is not likely to fit in memory.

PageRank with a sparse matrix

However, it is possible to rewrite the algorithm in order to work with the sparse adjacency matrix \(G\) and save a lot of memory. The trick consist on computing the Google matrix per row in each iteration in order to save memory and pay CPU cycles.

Here is the pseudocode.

  • \(\vec{r} \leftarrow 1/n\)
  • \(flag \leftarrow True\)
  • while \(flag\)
    • \(\vec{r}_{(t+1)} \leftarrow G \cdot \frac{\vec{r}_t \cdot \beta}{\mathbf{d}_{out}} \)
    • \(\vec{r}_{(t+1)} \leftarrow \vec{r}_{(t+1)} + \frac{1-\vec{r}_{(t+1)}}{n} \quad\) (Re-insert leaked PageRank)
    • \(flag \leftarrow \lVert \vec{r}_{(t+1)} - \vec{r}_{(t)} \rVert _1 < \epsilon \quad\) (Stop condition)
    • \(\vec{r}_{(t)} \leftarrow \vec{r}_{(t+1)}\)

Where: - \(G\) is the adjacency/connectivity matrix. - \(\mathbf{d}_{out}\) is a vector with the out-degree values of the nodes.

It is also possible to compute the value of \(\mathbf{d}_{out} \cdot \frac{1}{\beta}\) since it is an invariant inside the loop.

Implementation

In [8]:
def compute_PageRank(G, beta=0.85, epsilon=10**-4):
    '''
    Efficient computation of the PageRank values using a sparse adjacency 
    matrix and the iterative power method.
    
    Parameters
    ----------
    G : boolean adjacency matrix. np.bool8
        If the element j,i is True, means that there is a link from i to j.
    beta: 1-teleportation probability.
    epsilon: stop condition. Minimum allowed amount of change in the PageRanks
        between iterations.

    Returns
    -------
    output : tuple
        PageRank array normalized top one.
        Number of iterations.

    '''    
    #Test adjacency matrix is OK
    n,_ = G.shape
    assert(G.shape==(n,n))
    #Constants Speed-UP
    deg_out_beta = G.sum(axis=0).T/beta #vector
    #Initialize
    ranks = np.ones((n,1))/n #vector
    time = 0
    flag = True
    while flag:        
        time +=1
        with np.errstate(divide='ignore'): # Ignore division by 0 on ranks/deg_out_beta
            new_ranks = G.dot((ranks/deg_out_beta)) #vector
        #Leaked PageRank
        new_ranks += (1-new_ranks.sum())/n
        #Stop condition
        if np.linalg.norm(ranks-new_ranks,ord=1)<=epsilon:
            flag = False        
        ranks = new_ranks
    return(ranks, time)

Experiment

In [9]:
print '==> Computing PageRank'
%time pr,iters = compute_PageRank(csr)
print '\nIterations: {0}'.format(iters)
print 'Element with the highest PageRank: {0}'.format(np.argmax(pr)+1)
==> Computing PageRank
CPU times: user 484 ms, sys: 0 ns, total: 484 ms
Wall time: 487 ms

Iterations: 36
Element with the highest PageRank: 89073

Validation and Conclusion

Here is another way to compute the PageRank using NetworkX.

In [10]:
import networkx as nx

print '==> Loading data.'
with open(DATASET_ORIGINAL,'r') as f:
    edgelist = [
        tuple(int(x)-1 for x in line.split())
        for line in f.readlines()[4:]
    ] 
    
print '\n==> Building graph.'
%time g = nx.from_edgelist(edgelist, create_using=nx.DiGraph())

print '\n==> Computing PageRank'
%time pr = nx.pagerank(g)

pr_max = max(pr.items(), key= lambda x: x[1])
print '\nElement with the highest PageRank: {0}'.format(pr_max[0]+1)
==> Loading data.

==> Building graph.
CPU times: user 6.02 s, sys: 220 ms, total: 6.24 s
Wall time: 6.16 s

==> Computing PageRank
CPU times: user 29.7 s, sys: 488 ms, total: 30.2 s
Wall time: 30.1 s

Element with the highest PageRank: 89073

NetworkX and I agree on the element with the highest PageRank. It is possible to see the times are are higher. NetworkX last nearly 60 times more than my implementation to calculate the PageRank scores.



Bibliography

1 comment:

  1. hi samuel,
    we are engineering students and we are doing a project on graph mining. We'll be using pagerank algorithm for link analysis and prediction on the vertices of a graph. We need guidance and some new ideas to confirm the direction of our project. Also we r planning to calculate the structural properties of graph (in degree, density etc) using python. We found ur blog helpful and relevant to our project. Hope to get ur reply soon. Thank you and keep it going. :)

    ReplyDelete

Copyright © Samuel M.H. All rights reserved. Powered by Blogger.