Go to file

ben f7724dad2f data from conspricay theorists and world news		2017-06-06 12:32:38 +02:00
data	data from conspricay theorists and world news	2017-06-06 12:32:38 +02:00
scripts	Major cleanup. Fixed crawler.py and crawled reddit from /r/Iceland/. Renamed files to more understandable names and fixed up the readme	2014-03-18 10:18:38 +00:00
.gitignore	gitignroe finally	2014-06-08 18:50:58 +00:00
README.md	derp	2014-06-08 20:39:45 +00:00
comments-tools.py	check for comment links	2014-06-08 20:38:41 +00:00
comments.py	converting data from comments.py to the same format as from crawler.py	2014-06-08 18:53:44 +00:00
complexnetworks.pdf	adding tthe paper that i keep loosing to the git repo so i can find it later	2014-06-22 19:07:50 +00:00
crawler.py	Cleanup and adding BFS depth, better argument handling to track a smaller circle of subreddits	2017-06-06 02:21:17 +02:00
dbmodel.py	Cleanup and adding BFS depth, better argument handling to track a smaller circle of subreddits	2017-06-06 02:21:17 +02:00
graphtool_analyze.py	Building a gephi file with graph-tool	2014-03-19 14:31:53 +00:00
igraph_analyze.py	Fixed the crawler, finished dbmodel.py and renamed some files to more sensible names. missed some file in `c795b07a52`	2014-03-18 10:36:27 +00:00
igraph_layouts.py	Fixed the crawler, finished dbmodel.py and renamed some files to more sensible names. missed some file in `c795b07a52`	2014-03-18 10:36:27 +00:00
notes.txt	cleanup and moving pickle files to the picklejar"	2013-04-29 14:06:26 +00:00
old-crawler.py	Fixed the crawler, finished dbmodel.py and renamed some files to more sensible names. missed some file in `c795b07a52`	2014-03-18 10:36:27 +00:00
updater.py	Removing the old crawler code	2013-04-29 14:21:07 +00:00

README.md

reddit-communities

Crawling and graphig reddit, with the purpose of analysing it as a social network.

How?

Subreddits commonly link to other subreddits in their "sidebar". This is available with the Reddit API. We then interpret each subreddit as a vertex in the graph and the links between subreddits as edges.

I first wrote a crawler that saves the info in a sqlite database. Then it's easy to convert that data into whatever analytic tools I want to use.

Usage

Sidebar data

Use crawler.py to crawl from a given start and build the sqlite database in data/reddit.db
```
 $ python crawler.py /r/Iceland/
```
Use graphtool_analyze.py to create data/reddit.gml to open with gephi. I've also used igraph. The file igraph_analyze programmatically calculates some statistics (mean geodesic distance and the top list based on vertex-degree).

Comments data

To crawl comments, run comments.py and leave it running.
```
 $ python comments.py
```
Convert this data to the same format as from crawler.py with comments-tools.py.
```
 $ python comments-tools.py --convert
```

The data then gets stored in the table comments_mapping as an undirected graph. This otherwise mirrors the format of the data in mapping and now graphtool_analyze.py (i need to get better at naming things) should work on this dataset as well.

You can use comments-tools.py to check if a link exists between subreddits:

 $ python comments-tools.py --check /r/askreddit/ /r/programming/
 True

Methodology

I realized that there are two sources of data that connect subreddits together.

The links in the sidebar
If a user comments in two or more subreddits, these subreddits are connected by the user. This might reveal intersting data about what people are interested in.

Ideas and improvements

At first i decided to ignore edge weight, but I have realized this might be useful data. My idea is to change the comments method to use the number of users that have commmented in the same pairs of subreddits as weight.

Novel findings

So far I have crawled about 24,000 subreddits (vertices) and just above 160,000 links (edges) between them. There are much more according to metareddit.

The starting point has been /r/Iceland/, with no special reason except I've frequented it the longest.

I have found that some communities are topologically disconnected from the rest of reddit. But since this is a directed graph, they are not nessecarily topologically disconnected from reddit. A prime exaple of this are the communities formed around /r/clojure/.

If we order subreddits with respect to their in-degree (number of subreddits linking to them) and without regard of the number of subscribers, we reveal one interesting statistic abour reddit.

This is the top 10:

 $ python graph.py
 0. /r/music/
 1. /r/kateupton/
 2. /r/emmawatson/
 3. /r/starlets/
 4. /r/mileycyrus/
 5. /r/jenniferlawrence/
 6. /r/vanessahudgens/
 7. /r/emmastone/
 8. /r/emiliaclarke/
 9. /r/arianagrande/

In case you are wondering, the Top 100 list shows the same behavoir. wtf, Reddit?

Notes

Layout algorithms not used (igraph)

For layouts actually used, see the code.

The Sugiyama layout algorithm repeatedly segfaulted. "sugiyama": Sugiyama layout. Segmentation fault.
Reingold-Tilford tree layouts. Reddit is not a tree "rt": Reingold-Tilford tree layout "rt_circular": circular Reingold-Tilford tree layout
Various 3D layouts "kk_3d": 3D Kamada-Kawai "fr_3d": 3D Fruchterman- Reingold layout "drl_3d"

Author

Benedikt Kristinsson

benedikt@inventati.org