Language verbosity comparison using Rosetta Code

Reason for research: Unfamous Lines of Code (LoC)

LoC is one of the metrics that is used to measure programming productivity. One of the obvious disadvantages of this metric is a comparison of lines of code among developers using different programming languages.

It is unfair to compare languages in terms of lines because

  1. Different languages are better for different purposes

  2. Syntax sometimes bloats source code

And yet here I am trying to give a shot to do this comparison.

Inspiration

In my search for language comparison, I stumbled across an article by James Fisher "Programming language verbosity".

James as a measure of redundancy used compression ratio.
The lower compression ratio of language equals higher verbosity of language.

Data gathering

Unlike James, I decided not to go with open-source projects for verbosity comparison, since one of the factors decreasing the compression ratio is the length of text.

If we would compress 10 lines of Java and 1000 lines of Ruby it would look like Java is much less verbose than Ruby.

For this reason, I thought that comparing similar lengths of code, or even better code implementing the same logic would be better for this job. Here very much in handy came Rosetta Code which presents implementations of the same tasks across hundreds of different programming languages.

Let's first gather the data:

import csv
import zlib

import requests
from lxml import etree
from urllib.parse import urljoin

# Get all different tasks first
tasks = dict()
base_url = "https://rosettacode.org"

next_page = "/wiki/Category:Programming_Tasks"
while next_page:
    url = urljoin(base_url, next_page)
    resp = requests.get(url)
    root = etree.HTML(resp.text)
    _tasks = dict(
        zip(
            root.xpath("//div[@class='mw-category mw-category-columns']//a/@title"),
            root.xpath("//div[@class='mw-category mw-category-columns']//a/@href"),
        )
    )
    tasks.update(_tasks)
    _next_href = root.xpath("//a[text()='next page']/@href")
    if _next_href:
        next_page = _next_href[0]
    else:
        break

# Extract data and safe in csv file
with open("language_data.csv", "w", newline="") as csvfile:
    fieldnames = ["task", "lang", "loc", "utf8_len", "compressed_len"]
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()

    # Iterate over tasks
    for task_name, task_path in tasks.items():
        url = urljoin(base_url, task_path)
        resp = requests.get(url)
        root = etree.HTML(resp.text)

        lang_headlines = root.xpath('//h2/span[@class="mw-headline"]')

        # Iterate over language implementations of task
        for lang_headline in lang_headlines:
            headline_parent = lang_headline.xpath("./..")[0]
            following_code_sibling = headline_parent.xpath(
                './following-sibling::div[@dir="ltr"][1]'
            )
            if not following_code_sibling:
                # Couldn't find any code block following language header
                continue

            preceding_headline_parent = following_code_sibling[0].xpath(
                "./preceding-sibling::h2[1]"
            )[0]

            if preceding_headline_parent != headline_parent:
                # Found code block belongs to different language
                continue

            code = "".join(following_code_sibling[0].xpath("./pre//text()"))
            loc = code.count("\n") + 1
            utf8_encoded = code.encode("utf-8")
            compressed = zlib.compress(utf8_encoded)
            writer.writerow(
                {
                    "task": task_name,
                    "lang": lang_headline.xpath("./@id")[0],
                    "loc": loc,
                    "utf8_len": len(utf8_encoded),
                    "compressed_len": len(compressed),
                }
            )

Data exploration

Now that I got the data in CSV I decided to do some charts using Superset.
I wanted to focus on a small subset of languages so I filtered out data a bit.

Let's check the compression ratio by language:

compression ratio by language

First things first if we see Ruby with the highest compression ratio and Java with the lowest it makes some sense, but PHP being in the third place made me think that something is off (No offence PHP).

Number of tasks implementations by language

Now it is clear that most probably I fell into the trap that I wanted to avoid in the first place by using tasks from Rosetta Code.

Let's check again the average ratio but only for the tasks that are implemented in all of the languages that I'm comparing.

Compression ratio by language across same tasks implementations

It is interesting to see that for Java even though the average LoC in tasks is quite low, the compression ratio stays low as well.

Average lines of code by language across same tasks implementations

As I mentioned before one of the factors for compression ratio is the size of the source code.

Ratio over lines of code

Additionally, I wanted to calculate logarithmic regression for each of the languages for Ratio/LoC.
For that let's go back to the code for a second:

import matplotlib.pyplot as plt

import pandas as pd
import numpy as np

languages = (
    "C",
    "C#",
    "C++",
    "Go",
    "Haskell",
    "Java",
    "JavaScript",
    "Julia",
    "Kotlin",
    "Lua",
    "Pascal",
    "Perl",
    "PHP",
    "PowerShell",
    "Python",
    "Ruby",
    "Rust",
    "Swift",
)

df = pd.read_csv("language_data.csv")
# Comprassing highly compressed files can increase the size
df = df[df["utf8_len"] > df["compresed_len"]]
# Filter by languages
df = df[df["lang"].isin(languages)]

df["ratio"] = df["compresed_len"] / df["utf8_len"]
grouped_df = df.groupby("lang").aggregate({"loc": list, "ratio": list})

# Calculate regression coefficients
regression_coefficients = grouped_df.apply(
    lambda x: pd.Series(np.polyfit(np.log(x["loc"]), x["ratio"], 1), index=["b", "a"]),
    axis=1,
)
regression_coefficients.reset_index(inplace=True)

# Calculate compression rate for file with 100 lines of code
regression_coefficients["100_line"] = (
    np.log(100) * regression_coefficients["b"] + regression_coefficients["a"]
)

regression_coefficients = regression_coefficients.sort_values(
    "100_line", ascending=False
)

xpts = np.arange(1, 101)

plt.figure(figsize=(20, 20))
for lang, a, b in zip(
    regression_coefficients["lang"],
    regression_coefficients["a"],
    regression_coefficients["b"],
):
    fn_x = lambda x: a + np.log(x) * b
    plt.plot(xpts, fn_x(xpts), label=lang)

plt.legend(loc="upper right")
plt.ylim(0, 1.1)
plt.show()

I think though that for this exercise would be better to gather data from open-source projects to have a vast range of files containing a lot of lines of code.

Conclusion?

I don't believe I have any, to be honest. You can see that some languages are more verbose than others, but I believe that the comparison of Lines of Code between different languages should be thrown out of the window for a few reasons.

Many times when you gonna compare them, you need to remember that most probably different languages in a team are used to solve different kinds of problems.

More doesn't mean better or worse. Having to solve a problem I will always choose the language that I'm comfortable with even if it will take me more lines, just because I will do it much faster with a tool that I'm familiar with.

And last but not least, measuring developers' productivity in lines of code (no matter if we have a great comparison between languages or not) tends to make them write overcomplicated code.