Analysis of GitHub Repositories Stargazers Using Benford's Law

Motivation

In the constantly evolving landscape of software development, the tools and libraries we choose play a crucial role in the success and reliability of our projects.

When it comes to selecting libraries as dependencies for software projects, developers often rely on several key factors. These include the recency of contributions to the repository, its licensing terms, the size of its user community, and, notably, the number of stars it has received on GitHub. Stars serve as a rough measure of a repository's popularity and trustworthiness in the open-source community.

However, GitHub stars are susceptible to manipulation just like the number of followers on Instagram. It's possible to artificially inflate the number of stars by purchasing them on various shady platforms. This poses a challenge for developers trying to assess the credibility and quality of a repository based on its star count.

To address this, let's dive into an analysis technique grounded in Benford's Law.

What is Benford's Law?

Benford's Law, also known as the Law of First Digits, is an observation about the distribution of leading digits in many datasets. Contrary to what one might expect, the number 1 appears as the leading digit approximately 30% of the time, with subsequent numbers appearing less frequently, and so on. This pattern has been observed in a wide variety of datasets, including stock market data, census statistics, and the lengths of rivers.

A set of numbers is said to satisfy Benford's law if the leading digit d (d ∈ {1, ..., 9}) occurs with probability:

${isplaystyle P(d)=og _{10}(d+1)-og _{10}(d)=og _{10}eft({rac {d+1}{d}}ight)=og _{10}eft(1+{rac {1}{d}}ight).}$

Given its widespread applicability, Benford's Law serves as a valuable tool for detecting anomalies and potential data manipulations.

By applying Benford's Law to the number of repositories a stargazer has starred we can potentially uncover repositories that deviate from the expected distribution. Such deviations may suggest artificial manipulation of star counts.

After calculating the distribution of leading digits for 81 repositories there was detected:

16 repositories that have p-values below 0.05 using the Chi-squared test
4 repositories that have p-values below 0.05 using the Kolmogorov–Smirnov test (a subset of repositories rejected by the Chi-squared test)

For plotting the data let's pick 4 repositories that were rejected by the Kolmogorov–Smirnov test and 4 random repositories that weren't rejected by the Chi-squared test.

Benford's distribution of rejected repositories

Benford's distribution of randomly selected repositories

Detecting Anomalies in GitHub Repositories by rapid growths

While some repositories might adhere to Benford's distribution, a deeper dive into their star history can reveal suspicious patterns. Rapid surges in stars, especially shortly after a repository's creation, can be indicative of inorganic growth, perhaps due to bot activity or purchased stars.

Rapid growths of rejected repositories

Rapid growths of randomly selected repositories

Analyzing the weekly growth of stargazers over time provides another layer of insight. Sharp, unexplained spikes in stargazers can further hint at manipulation.

Weekly growths of rejected repositories

Weekly growth of randomly selected repositories

Conclusion

Always ensure thorough due diligence when selecting libraries or tools, looking beyond surface-level metrics.

While metrics like GitHub stars offer a quick way to gauge a repository's credibility, it's essential to approach such indicators with a critical eye. Anomalies, rapid growth spurts, and other suspicious patterns can signal underlying issues or manipulations.

Analysis source code

For a deeper dive into the analysis, including code and detailed visualizations, check out the full project on GitHub: https://github.com/grumpy-miner/repos_stargazers_analysis

P.S. If you'd like to see an open-source project that analyses repositories and provides some useful metrics on stargazers, do not hesitate to leave a comment. :)

Analysis of GitHub Repositories Stargazers Using Benford's Law

Motivation

What is Benford's Law?

Benford's distribution of rejected repositories

Benford's distribution of randomly selected repositories

Detecting Anomalies in GitHub Repositories by rapid growths

Rapid growths of rejected repositories

Rapid growths of randomly selected repositories

Weekly growths of rejected repositories

Weekly growth of randomly selected repositories

Conclusion

Analysis source code

Comments

More from this blog

5 Python Built-in Functions and Modules That Save Time

Dynamic creation of REST APIs from third-party databases with Django-Schema-Sprout

The Worrying Side of Stack Overflow: A Personal Analysis With BigQuery's Public Data

Language verbosity comparison using Rosetta Code

Command Palette

Motivation

What is Benford's Law?

Benford's distribution of rejected repositories

Benford's distribution of randomly selected repositories

Detecting Anomalies in GitHub Repositories by rapid growths

Rapid growths of rejected repositories

Rapid growths of randomly selected repositories

Weekly growths of rejected repositories

Weekly growth of randomly selected repositories

Conclusion

Analysis source code

Comments

More from this blog