GH Archive is a record of the public GitHub timeline, made available as a BigQuery public dataset and through downloadable archives.
All public GitHub Issues, Releases, Stars, Pull Requests, Commits, and more included on the public GitHub timeline. GH Archive makes this metadata available for analysis, including over 15 event types. You can easily analyze GH Archive data by using the Google Cloud Console to query the dataset.
This repository shares examples for how you can use BigQuery and the GH Archive dataset to analyze public GitHub activity for your next project.
GH Archive has a list of research, visualizations, and talks based on their dataset which may spur some ideas! A sampling of some of those projects include:
Whether you’re an individual developer, a community, or an Open Source Program Office (OSPO) managing multiple projects, the GH Archive dataset may be useful for you. Once your queries are written, you can apply them to new repositories and GitHub organizations, as well as adjust the time periods you’re analyzing.
Each of the following example queries build upon one another, looking at GitHub activity for projects released by the Apache Software Foundation. We’ll run these queries by entering them into the Google Cloud Web Console.
Note that while these queries look similar, there are noticable differences that impact what data is being queried as well as the results. Do you notice the differences?
SELECT
COUNT(DISTINCT events.actor.id)
FROM `githubarchive.day.20240201` AS events
WHERE
events.repo.name = 'apache/cassandra'
SELECT
COUNT(DISTINCT events.actor.id)
FROM `githubarchive.month.202402` AS events
WHERE
events.repo.name = 'apache/cassandra'
SELECT
COUNT(DISTINCT events.actor.id)
FROM `githubarchive.month.202402` AS events
WHERE
events.org.login = 'apache'
Each examples retrieves a single value as their output using COUNT()
in the query. The examples are also all counting unique contributors (note the use of DISTINCT
in the query to ensure that contributors are not double-counted).
What’s different? A few things!
githubarchive.day.20240201
githubarchive.month.202402
githubarchive.year.2023
WHERE
statements:
events.repo.name = 'apache/cassandra'
events.org.login = 'apache'
Here are some additional queries that may spur ideas for how you can use the GH Archive dataset:
SELECT
repo.name, created_at, id
FROM `githubarchive.day.20240424` AS events
WHERE
events.org.login IN ('apache') and type IN ('ReleaseEvent')
row | name | created_at | id |
---|---|---|---|
1 | apache/apisix | 2024-04-24 07:56:23 UTC | 37761915325 |
2 | apache/pulsar-dotpulsar | 2024-04-24 17:26:50 UTC | 37783197893 |
This query returns a list of results, in contrast to the prior example where a single metric was returned using COUNT
. Results can easily be saved and exported from BigQuery and used by other tools.
“Exploring GitHub with BigQuery at GitHub” (video)(2017) introduces you to the BigQuery UI, writing queries to access GH Archive, and visualizing data with tools like Tableau and Looker.
GH Archive data is also used by multiple services which analyze GitHub activity and provide higher-level interfaces, APIs, and open source tools: