bitbucket.org, code.google.com, or github.com host a lot of open source projects made by developers around the world. If you make a website or an article about given technology, language it would be nice to gather projects matching your topic from those code hosting providers. In this article I'll show Python scripts that can do that easily.
You can use Github API, that returns data in YAML format, but in case of search - you can't get all the search results, so we have to parse the HTML search results like this:
import urllib2
from re import findall
def github_getall(term, page=1):
opener = urllib2.build_opener()
opener.addheaders = [('user-agent', 'Opera/9.64 (X11; Linux x86_64; U; en) Presto/2.1.1')]
o = opener.open('http://github.com/search?type=Repositories&language=&q=%s&repo=&langOverride=&x=5&y=22&start_value=%s' % (term, str(page)))
data = o.read()
repos = findall( r'(?xs)<h2\s*class="title">(.*?)<a\s*href="(.*?)">(.*?)</a>(.*?)<div\s*class="description">(.*?)</div>''', data)
if len(repos) > 0:
for i in repos:
name = i[2].split(' / ')
author = name[0].strip()
title = name[1].strip()
url = 'http://github.com/%s/%s/' % (author, title)
desc = i[4].strip()
print title
print url
print desc
print
print github_getall('django', 1)
This function will get the search results and parse it returning a clean list of repositories. If you want to get all results use the paging
for i in range(1,10):
try:
github_getall('google+wave', i)
except:
# no more pages in pagination, etc.
print 'EXCEPTION'
For code.google.com we can use similar code:
import urllib2
from re import findall
def googlec(term, page=1):
openurl = 'http://code.google.com/hosting/search?q=%s&btn=Search+projects' % term
if page > 1:
openurl = '%s&start=%s' % (openurl, str(page*10))
opener = urllib2.build_opener()
opener.addheaders = [('user-agent', 'Opera/9.64 (X11; Linux x86_64; U; en) Presto/2.1.1')]
o = opener.open(openurl)
data = o.read()
repos = findall( r'(?xs)clk\(this,\s*([0-9]*)\)"(.*?)href="(.*?)">(.*?)\n\s*-\s*(.*?)</a>''', data)
if len(repos) > 0:
for i in repos:
url = 'http://code.google.com%s' % i[2]
title = i[3].strip()
desc = i[4].strip()
print title
print desc
print url
print
print googlec('django', 1)
For Bitbucked this will work nicely:
import urllib2
from re import findall
def bitbucket(term, page=1):
opener = urllib2.build_opener()
opener.addheaders = [('user-agent', 'Opera/9.64 (X11; Linux x86_64; U; en) Presto/2.1.1')]
o = opener.open('http://bitbucket.org/repo/all/popular/%s/?name=%s' % (str(page), term))
data = o.read()
repos = findall( r'(?xs)<span><a\s*href="(.*?)">(.*?)</a>\s*/\s*<a\s*href="(.*?)">(.*?)</a></span><br\s*/>(.*?) (.*?).<br\s*/>''', data)
if len(repos) > 0:
for i in repos:
author = i[1]
url = 'http://bitbucket.org%s' % i[2]
title = i[3]
desc = i[5].strip()
print title
print author
print url
print desc
print
print bitbucket('django', 1)
I've used presented in this article code to make a small Django app - Projects - that aggregates projects for given tags.
- Added: 09.10.2009 by riklaunim