# samucoder

Samuel M.H. 's technological blog

Notebook

# PageRank¶

AUTHOR: Samuelm M.H. (samuel.mh@gmail.com)
DATE: 6-Feb-2015
DESCRIPTION:
A small study on how to implement efficiently the PageRank algorithm using Python.


## Dataset¶

The chosen dataset is a directed graph where nodes represent pages from Stanford University (stanford.edu) and directed edges represent hyperlinks between them. This representation is suitable to be transformed into the adjacency matrix required for the PageRank algorithm.

Information: * URL: http://snap.stanford.edu/data/web-Stanford.html * Year: 2002 * Nodes: 281.903 (numeration starts at 1) * Edges: 2.312.497 * Format: id of origin (tabulator) id of destiny (data start at line 5 and there is no tuple duplication). Represents a hyperlink. * Size (uncompressed): 31,4MB

In [1]:
DATASET_ORIGINAL = "web-Stanford.txt" #Uncompress the .gz file!
NODES = 281903
EDGES = 2312497


### First thoughts¶

We need to load the dataset into memory as an adjacency matrix. It can be a square matrix having a boolean value in each cell. The way NumPy implements this is with a dense matrix where each value is a Byte. Let's see how much memory is needed.

More than 74GB to store the adjacency matrix!! It doesn't fit in the RAM of my laptop.

## Sparse Matrices¶

It seems reasonable that, if the dataset size is 31,4MB, there must be a representation of it that fits into memory. The key is that a single node is on average connected only to a few nodes (lets say 10). This means there are a lot of false values (not hyperlinks). As false values can be seen as a zeroes, we say the matrix is sparse. A good property of a sparse matrices is that their representation is much smaller as their dense equivalent.

There are several libraries to deal with sparse matrices in Python, but the one i've chosen is scipy.sparse

In [2]:
import numpy as np
from  scipy import sparse


The goal is to keep the adjacency matrix into a Compressed Sparse Row matrix. According to the CSR matrix documentation

Advantages of the CSR format: * Efficient arithmetic operations CSR + CSR, CSR * CSR, etc. * Efficient row slicing * Fast matrix vector products

So, the efficient row operations is what we need to apply the power method to compute the PageRank values.

Disadvantages of the CSR format * Slow column slicing operations (consider CSC) * Changes to the sparsity structure are expensive (consider LIL or DOK)

If we build an empty CSR matrix and try to modify its values we will get a warning like.
SparseEfficiencyWarning:
Changing the sparsity structure of a csr_matrix is expensive.
lil_matrix is more efficient.


The first solution could be using a Dictionary Of Keys based sparse matrix and then converting it to a CSR matrix. But it turns out to be slow.

In [3]:
def dataset2dok():
with open(DATASET_ORIGINAL,'r') as f:
dokm = sparse.dok_matrix((NODES,NODES),dtype=np.bool)
origin, destiny = (int(x)-1 for x in line.split())
dokm[destiny,origin]=True
return(dokm.tocsr())

%time dok_m = dataset2dok()

CPU times: user 34.6 s, sys: 256 ms, total: 34.9 s
Wall time: 34.8 s



Another solution could be building the Compressed Sparse Row matrix from its indices instead of modifying the cells.

In [4]:
def dataset2csr():
row = []
col = []
with open(DATASET_ORIGINAL,'r') as f:
origin, destiny = (int(x)-1 for x in line.split())
row.append(destiny)
col.append(origin)
return(sparse.csr_matrix(([True]*EDGES,(row,col)),shape=(NODES,NODES)))

%time csr_m = dataset2csr()

CPU times: user 5.7 s, sys: 60 ms, total: 5.76 s
Wall time: 5.74 s



How much memory does it take to store the adjacency (sparse) matrix?

According to the documentation, a CSR matrix is defined by: * dtype: data type of the matrix. Not really needed. * shape: 2-tuple shape of the matrix. * ndim: Number of dimensions (this is always 2). Not really needed. * data: CSR format data array of the matrix. * indices: CSR format index array of the matrix. * indptr: CSR format index pointer array of the matrix.

In [5]:
import sys

print "The size in memory of the adjacency matrix is {0} MB".format(
(sys.getsizeof(csr_m.shape)+
csr_m.data.nbytes+
csr_m.indices.nbytes+
csr_m.indptr.nbytes)/(1024.0**2)
)

The size in memory of the adjacency matrix is 12.1022920609 MB



Suming up. We have gone from 74GB to just 12MB by using a sparse matrix. That is pretty good!

Despite the fact that 6 seconds is not a huge loading time, it will be nice to store the CSR matrix in a native format so to speed up the loading/saving times.

Here the trick is to store the arrays the matrix is defined by: shape, data, indices, indptr. Furthermore, it is possible to omit the data array taking into consideration that is an array of "True" values of length the number of edges.

In [6]:
def csr_save(filename,csr):
np.savez(filename,
nodes=csr.shape[0],
edges=csr.data.size,
indices=csr.indices,
indptr =csr.indptr
)

return sparse.csr_matrix(
shape = (nodes,nodes)
)

In [7]:
DATASET_NATIVE = 'dataset-native.npz'
csr_save(DATASET_NATIVE,csr_m)

CPU times: user 48 ms, sys: 0 ns, total: 48 ms
Wall time: 110 ms



## PageRank Algorithm¶

When the PageRank algorithm is taught, the usual way to compute it consists on calculating the Google matrix $A$.

$A = \beta M + \frac{1-\beta}{n}\mathbf{e} \cdot \mathbf{e}^T$

Where: * $\beta$ is the probability a user follows a link, so $1-\beta$ is the teleportation factor. * $n$ is the number of elements (nodes). * $M$ is a matrix built like this: 1. The starting point is a graph or adjacency matrix $G$ (shape $n \times n$). If there is a link from $i$ to $j$ then the element $G_{j,i}=1$ 2. Normalize the matrix by columns, so the sum of every column is 1. 3. If there are columns that sum 0, then set their elements to $1/n$. * $\mathbf{e}$ is a vector of 1 of length $n$.

And then apply the Power Method:

$\vec{r}_{(t+1)} = A \cdot \vec{r}_{(t)}$

Until it converges:

$\lVert \vec{r}_{(t+1)} - \vec{r}_{(t)} \rVert _1 < \epsilon$

Then, $\vec{r}$ is an eigenvector of $A$ with the PageRank scores. A property of this vector is:

$\sum^{n}_{i=1} r_i = 1$

This approach has a huge drawback since it has to compute the Google matrix $A$, and, as seen before it is not likely to fit in memory.

### PageRank with a sparse matrix¶

However, it is possible to rewrite the algorithm in order to work with the sparse adjacency matrix $G$ and save a lot of memory. The trick consist on computing the Google matrix per row in each iteration in order to save memory and pay CPU cycles.

Here is the pseudocode.

• $\vec{r} \leftarrow 1/n$
• $flag \leftarrow True$
• while $flag$
• $\vec{r}_{(t+1)} \leftarrow G \cdot \frac{\vec{r}_t \cdot \beta}{\mathbf{d}_{out}}$
• $\vec{r}_{(t+1)} \leftarrow \vec{r}_{(t+1)} + \frac{1-\vec{r}_{(t+1)}}{n} \quad$ (Re-insert leaked PageRank)
• $flag \leftarrow \lVert \vec{r}_{(t+1)} - \vec{r}_{(t)} \rVert _1 < \epsilon \quad$ (Stop condition)
• $\vec{r}_{(t)} \leftarrow \vec{r}_{(t+1)}$

Where: - $G$ is the adjacency/connectivity matrix. - $\mathbf{d}_{out}$ is a vector with the out-degree values of the nodes.

It is also possible to compute the value of $\mathbf{d}_{out} \cdot \frac{1}{\beta}$ since it is an invariant inside the loop.

### Implementation¶

In [8]:
def compute_PageRank(G, beta=0.85, epsilon=10**-4):
'''
Efficient computation of the PageRank values using a sparse adjacency
matrix and the iterative power method.

Parameters
----------
G : boolean adjacency matrix. np.bool8
If the element j,i is True, means that there is a link from i to j.
beta: 1-teleportation probability.
epsilon: stop condition. Minimum allowed amount of change in the PageRanks
between iterations.

Returns
-------
output : tuple
PageRank array normalized top one.
Number of iterations.

'''
n,_ = G.shape
assert(G.shape==(n,n))
#Constants Speed-UP
deg_out_beta = G.sum(axis=0).T/beta #vector
#Initialize
ranks = np.ones((n,1))/n #vector
time = 0
flag = True
while flag:
time +=1
with np.errstate(divide='ignore'): # Ignore division by 0 on ranks/deg_out_beta
new_ranks = G.dot((ranks/deg_out_beta)) #vector
#Leaked PageRank
new_ranks += (1-new_ranks.sum())/n
#Stop condition
if np.linalg.norm(ranks-new_ranks,ord=1)<=epsilon:
flag = False
ranks = new_ranks
return(ranks, time)


### Experiment¶

In [9]:
print '==> Computing PageRank'
%time pr,iters = compute_PageRank(csr)
print '\nIterations: {0}'.format(iters)
print 'Element with the highest PageRank: {0}'.format(np.argmax(pr)+1)

==> Computing PageRank
CPU times: user 484 ms, sys: 0 ns, total: 484 ms
Wall time: 487 ms

Iterations: 36
Element with the highest PageRank: 89073



## Validation and Conclusion¶

Here is another way to compute the PageRank using NetworkX.

In [10]:
import networkx as nx

with open(DATASET_ORIGINAL,'r') as f:
edgelist = [
tuple(int(x)-1 for x in line.split())
]

print '\n==> Building graph.'
%time g = nx.from_edgelist(edgelist, create_using=nx.DiGraph())

print '\n==> Computing PageRank'
%time pr = nx.pagerank(g)

pr_max = max(pr.items(), key= lambda x: x[1])
print '\nElement with the highest PageRank: {0}'.format(pr_max[0]+1)

==> Loading data.

==> Building graph.
CPU times: user 6.02 s, sys: 220 ms, total: 6.24 s
Wall time: 6.16 s

==> Computing PageRank
CPU times: user 29.7 s, sys: 488 ms, total: 30.2 s
Wall time: 30.1 s

Element with the highest PageRank: 89073



NetworkX and I agree on the element with the highest PageRank. It is possible to see the times are are higher. NetworkX last nearly 60 times more than my implementation to calculate the PageRank scores.