Let's Talk About Statistics - Introduction to Statistics for Software Quality (No.15 Practical Case Studies of Statistical Quality Control (SQC))
Back to Top
To reach a broader audience, this article has been translated from Japanese.
You can find the original version here.
Introduction: The Grand Finale of Quality Analysis Has Arrived
#"Why aren't defects decreasing?" "Is this process really stable?"
― Can you answer such questions with statistics?
So far, we have learned many analytical methods from a statistical perspective on the theme of software quality.
And now, in Part 15 of "Let's Talk About Statistics," as the culmination of this series, we introduce Statistical Quality Control (SQC: Statistical Quality Control).
SQC is a framework that leverages statistical methods to visualize the quality of processes and deliverables and to continuously improve them.
It is highly useful not only in manufacturing but also in software development and testing for analyzing bug trends and verifying process stability.
"Quality is not a matter of chance, but something to be controlled"—SQC strongly supports this idea.
In this chapter, building on the knowledge we have acquired so far, we explain, with concrete examples, how to apply it to real-world software quality management.
Main SQC Metrics for Practical Use
#"What does it mean to 'visualize quality' in concrete terms?"
It simply means having metrics that allow you to quantitatively grasp the state on the ground.
Instead of relying on subjective intuition or personal experience, you need tools that can make data-driven decisions.
Here, we introduce representative SQC metrics frequently used in software quality management.
They all serve as weapons for "speaking about quality in numbers."
Metric | Example Use |
---|---|
Bug Density | Quantitative quality evaluation (e.g., defects per KLOC) |
Review Defect Rate | Verifying review coverage and effectiveness |
Defect Distribution (Pareto) | Identifying concentrated defect causes |
Hypothesis Testing (t-tests, etc.) | Statistically testing the effectiveness of quality improvement or review process changes |
Regression Analysis (simple, etc.) | Analyzing relationships between test effort or review scope and outcomes to quantitatively assess improvements and design validity |
Metrics should not be mere numbers for reporting but hints that inspire "insight" and "action."
Case 1: Defect Density and Delivery Decisions
#● Background:
#In a certain project (PJ), we introduced "defect density (defects/KLOC)" as the quality criterion to decide whether to approve the final release.
This metric, rather than just the number of defects, considers the size of the source code, making it easier to compare across projects and against past performance.
● Actions:
#- Collected the average defect density and its standard deviation (σ) from past projects as a comparison baseline.
- If the target project's defect density exceeded "average + 3σ", we considered it a statistically abnormal level of defects, and:
- Judged it as a quality risk
- Conducted additional reviews and focused checks to reconsider the release decision
● Key Points:
#- The normalized metric of defects per KLOC allows comparison between large and small projects.
- Using the 3σ rule enables objective and explainable decisions.
- This method is the first step in moving away from subjective ‘delivery decisions’ and towards quantitatively managed quality standards.
Quality decisions based on numbers are key to increasing team-wide buy-in and transparency in decision-making.
● Example: Delivery Decision Logic Based on Defect Density (Python)
#import numpy as np
import matplotlib.pyplot as plt
plt.rcParams['font.family'] = 'Meiryo'
# Sample data of defect densities (defects/KLOC) from past projects
past_defect_densities = np.array([12.5, 15.0, 11.8, 14.3, 13.2, 16.1, 12.9])
# Current project's defect density
current_density = 23.7 # ←You can change this to check other projects
# Calculate mean and standard deviation
mean = np.mean(past_defect_densities)
std_dev = np.std(past_defect_densities, ddof=1) # Calculated as a sample, not population
# Threshold = mean + 3σ
threshold = mean + 3 * std_dev
# Decision
print(f"Past Mean: {mean:.2f}, Std Dev: {std_dev:.2f}, Threshold: {threshold:.2f}")
print(f"Current Project Defect Density: {current_density:.2f}")
if current_density > threshold:
print("△: Quality risk present—consider conducting additional reviews.")
else:
print("〇: Quality risk is considered low.")
The results are as follows:
Past Mean: 13.69, Std Dev: 1.51, Threshold: 18.23
Current Project Defect Density: 23.70
△: Quality risk present—consider conducting additional reviews.
● Explanation of Statistical Methods
#1. Calculation of Defect Density
- Definition:
- Role: A normalized measure of software quality by code size, allowing comparison across projects.
2. Use of Basic Statistics (Mean and Standard Deviation)
- Mean: Understands the "central tendency" of past performance values.
- Standard Deviation: Evaluates the variability (i.e., stability of quality) in defect density.
- Purpose: To determine whether the current project falls within the "normal range" of past performance.
3. 3σ Rule (Outlier Detection Based on Normal Distribution)
- Concept:
- Assuming a normal distribution, about 99.7% of data falls within "mean ± 3σ."
- Data beyond this range is considered "statistically abnormal."
- Practical Interpretation:
- Defect density exceeding mean + 3σ → judged as a high-quality risk state, triggering additional reviews.
4. Outlier Detection and Decision Support
- This method quantifies risk based on mathematical evidence, enabling objective quality decisions rather than subjective ones.
● Summary of Case 1
#In this case, by combining descriptive statistics (mean & standard deviation) with the 3σ rule for the practical metric of "defect density," we achieve process stability assessment and release decision support.
This is arguably the most fundamental approach in SQC (Statistical Quality Control).
Case 2: Analysis of Review Defect Rate
#● Background:
#Although code reviews were conducted, many bugs were still found in later stages.
We investigated the question: "Why are bugs being missed even though reviews were supposed to have been performed?"
● Actions:
#- Visualized, at a file level, the ratio of "number of review comments" to "number of reviews conducted" (i.e., the review defect rate).
- Discovered that certain areas had extremely low review defect rates.
- Inferred that those areas were "formally reviewed on record but were essentially superficial reviews."
- Implemented a review process overhaul (e.g., formalizing review criteria, introducing pair reviews) in the affected areas.
● Key Points:
#- The strength of the review defect rate is that it visualizes formal reviews and questions their effectiveness.
- By detecting extreme zero-defect rates or skewed patterns, you can reveal hidden signs of quality risk.
● Example: Visualizing Review Defect Rate (Python)
#import matplotlib.pyplot as plt
import numpy as np
# Japanese font (Windows environment)
plt.rcParams['font.family'] = 'Meiryo'
# Number of reviews conducted and number of defects identified per module
modules = ['ModuleA', 'ModuleB', 'ModuleC', 'ModuleD', 'ModuleE']
review_counts = np.array([10, 12, 9, 15, 13]) # Reviews conducted
defect_counts = np.array([4, 0, 5, 1, 6]) # Defects identified
# Calculate defect rate (avoid division by zero)
with np.errstate(divide='ignore', invalid='ignore'):
review_rates = np.where(review_counts > 0, defect_counts / review_counts, 0)
# Visualize with a bar chart
plt.figure(figsize=(8, 5))
bars = plt.bar(modules, review_rates, color='skyblue')
plt.ylabel('Review Defect Rate')
plt.ylim(0, 1.0)
plt.title('Review Defect Rate by Module')
# Display values above bars
for bar, rate in zip(bars, review_rates):
plt.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.02,
f'{rate:.2f}', ha='center')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()
The results were as follows:
ModuleB
had 0 defects identified out of 12 reviews → possible superficial reviewModuleD
also had a low defect rate (1/15) → possible insufficient coverage or reviewer inexperience
Through such visualization, you can quantitatively extract problematic review areas and target process improvements precisely.
● Explanation of Statistical Methods
#-
Ratio Metric (Review Defect Rate)
- Defect Rate = Number of Defects Identified ÷ Number of Reviews Conducted
- A representative process metric for measuring the "quality" of reviews
-
Visual Outlier Detection (Part of EDA)
- Visualize extreme zeros or variability in a bar chart to detect signs of formalized or hollow reviews
-
Introduction of a Practical Quantitative Improvement Cycle
- Starting from "Why are there so few defects identified?", leading to improvements in review objectives, methods, and criteria
● Summary of Case 2
#This case demonstrates how using a rate rather than absolute counts makes the quality of reviews visible, surfacing the hard-to-see issue of "superficial reviews" with data.
It’s not advanced statistical analysis, but it is a fundamental approach of SQC directly applicable to practice.
Case 3: Focused Management with Defect Distribution (Pareto Analysis)
#● Background:
#Defects were occurring frequently, but priorities for where to start were unclear.
There were not enough resources to "tackle everything," leading to ineffective improvements.
● Actions:
#- Aggregated defect counts by cause category and module
- Created a Pareto chart and confirmed that approximately 80% of defects were caused by the top 20% of categories
- Prioritized improvement activities (training, focused reviews, design improvements) on the top causes
● Key Points:
#- In a context where you "can’t address everything," this method visualizes where to invest limited resources.
- By focusing improvements based on numbers, you gain high explanatory power and buy-in on the ground.
- As a representative approach where you can readily see SQC “results,” it’s easy to demonstrate actual improvement outcomes.
● Example: Creating a Pareto Chart by Defect Category (Python)
#import matplotlib.pyplot as plt
import numpy as np
# Japanese font settings (for Windows environment)
plt.rcParams['font.family'] = 'Meiryo'
# Defect categories and counts (example)
categories = ['Specification Omission', 'Design Error', 'Coding Error', 'Insufficient Review', 'Test Omission']
counts = np.array([35, 25, 20, 10, 5])
# Sort in descending order by count
sorted_idx = np.argsort(counts)[::-1]
sorted_categories = np.array(categories)[sorted_idx]
sorted_counts = counts[sorted_idx]
# Calculate cumulative percentage
cumulative = np.cumsum(sorted_counts)
cumulative_percent = cumulative / cumulative[-1] * 100
# Plotting
fig, ax1 = plt.subplots()
ax1.bar(sorted_categories, sorted_counts, color='skyblue', label='Number of Defects')
ax1.set_ylabel('Number of Defects')
ax1.set_ylim(0, max(sorted_counts)*1.2)
ax2 = ax1.twinx()
ax2.plot(sorted_categories, cumulative_percent, color='red', marker='o', label='Cumulative Percentage (%)')
ax2.set_ylabel('Cumulative Percentage (%)')
ax2.set_ylim(0, 110)
plt.title('Pareto Chart by Defect Category')
fig.tight_layout()
plt.grid(True, axis='y', linestyle='--', alpha=0.6)
plt.show()
The results were as follows:
● Explanation of Statistical Methods
#1. Pareto Analysis
- Visualizes the phenomenon where a minority of causes leads to a majority of results, based on the 80:20 rule
- Combines bar charts (individual values) and line charts (cumulative percentage)
2. Setting Priority of Items
- Not just sorting by count; you can also consider "impact" and "frequency"
- Enables data-driven prioritization of improvement efforts
3. Decision Support in SQC
- Pareto charts are extremely effective visual tools for explaining the priority and rationale of improvement activities
● Summary of Case 3
#This case is a representative example of SQC that shows where to focus resources using data.
It helps achieve maximum effect with limited time, personnel, and cost through “focused management.”
Case 4: Evaluating the Effect of Review Improvement Measures
#● Background:
#After implementing measures to strengthen review criteria (introducing checklists, formalizing viewpoints, etc.), it was necessary to evaluate, "Did the quality of reviews (number of defects identified) actually improve?"
● Actions:
#- Mean difference test using a t-test
- Collected review defect count data for "before" and "after" the improvement measures
- Used t-test (two-sample test without pairing) to evaluate whether there was a difference in average defect counts between the two groups
- If a statistically significant difference was found, we judged the measures to be effective
● Key Points:
#- You can judge the effectiveness of measures with data rather than intuition.
- Testing is the most direct way to give statistical evidence to quality improvement decisions.
- Concepts such as null hypothesis, alternative hypothesis, p-value, and significance level are fundamental to statistics.
● Example: Evaluating the Effectiveness of Review Improvement Measures (Python)
#import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
plt.rcParams['font.family'] = 'Meiryo' # Japanese font (Windows environment)
# Number of defects identified per review before improvements (example)
before = np.array([2, 3, 1, 2, 3, 2, 1])
# Number of defects identified per review after improvements
after = np.array([4, 5, 3, 4, 5, 6, 4])
# t-test (Welch’s t-test, assuming unequal variances)
t_stat, p_value = stats.ttest_ind(after, before, equal_var=False)
print(f"t-statistic: {t_stat:.2f}")
print(f"p-value: {p_value:.4f}")
# Decision based on results
alpha = 0.05
if p_value < alpha:
print("〇 Significant difference: The measures are judged to be effective.")
else:
print("△ No significant difference: The measures' effectiveness cannot be confirmed statistically.")
The results were as follows:
t-statistic: 5.05
p-value: 0.0003
〇 Significant difference: The measures are judged to be effective.
● Explanation of Statistical Methods
#1. Hypothesis Testing (t-test)
- Null Hypothesis : "There is no difference in mean defect counts before and after improvements."
- Alternative Hypothesis : "There is a difference in means."
- Calculate the t-statistic and p-value, and judge if there is a difference at a 5% significance level (p < 0.05)
2. Welch’s t-test
- Assumes unequal population variances, using a t-test that does not assume equal variances
- In practice, sample sizes are often small, making Welch's method more broadly applicable
3. Decision Support
- The test results provide quantitative evidence for the effectiveness of improvement activities
- Directly ties into detecting quality changes in SQC
● Summary of Case 4
#This case illustrates answering the question "Did the measures make a difference?" with statistical evidence.
Testing is a fitting final approach as the technical support for SQC’s "moment of decision."
Case 5: Analyzing the Relationship Between Review Scope and Defect Counts
#● Background:
#We tested the hypothesis that the more lines of code reviewed, the more defects would be identified.
The goal was to confirm whether review effort corresponds to results and to use this for optimizing review design and viewpoints.
● Actions:
#- Collected data on "lines of code reviewed (SLOC)" and "number of defects identified" from past review records
- Used simple regression analysis to model the relationship between review scope and defect counts
- Checked the strength of correlation and slope of the regression equation to consider an appropriate review granularity
● Key Points:
#- Validates the balance between review quality and quantity: "the more you look, the more you find"?
- From the regression coefficient, you can quantitatively obtain the "expected value" of review effectiveness
- If fewer defects are found than expected, it also raises suspicion of omitted viewpoints or superficial reviews
● Example: Regression Analysis of Review Scope vs. Defect Counts (Python)
#import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
plt.rcParams['font.family'] = 'Meiryo'
# Data: lines of code reviewed (SLOC) and defects identified
x = np.array([100, 200, 300, 400, 500]).reshape(-1, 1)
y = np.array([2, 4, 6, 7, 8])
# Regression analysis
model = LinearRegression()
model.fit(x, y)
# Regression coefficient and intercept
a = model.coef_[0]
b = model.intercept_
print(f"Regression equation: Defect Count = {a:.2f} × Lines + {b:.2f}")
# Visualization
x_pred = np.linspace(50, 600, 100).reshape(-1, 1)
y_pred = model.predict(x_pred)
plt.scatter(x, y, color='blue', label='Actual Data')
plt.plot(x_pred, y_pred, color='red', label='Regression Line')
plt.xlabel('Lines of Code Reviewed (SLOC)')
plt.ylabel('Defect Count')
plt.title('Relationship Between Review Scope and Defect Counts')
plt.grid(True)
plt.legend()
plt.tight_layout()
plt.show()
The results were as follows:
● Explanation of Statistical Methods
#1. Simple Regression Analysis
- Explanatory Variable: Lines of code reviewed (SLOC)
- Response Variable: Number of defects identified
- Regression Line: Estimate for prediction and comparison
2. Meaning of the Regression Coefficient
- A large coefficient means that "as lines increase, defects increase" = a consistent defect density
- A small coefficient means that even with more lines, defects don't increase → suspicion of review superficiality
3. Model Evaluation and Residuals
- Use and residual plots to assess model fit, explanatory power, and pattern biases
Below is model evaluation and residual analysis (visualized in Python).
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
plt.rcParams['font.family'] = 'Meiryo'
# Data
x = np.array([100, 200, 300, 400, 500]).reshape(-1, 1)
y = np.array([2, 4, 6, 7, 8])
# Train the model
model = LinearRegression()
model.fit(x, y)
# Prediction and R^2
y_pred = model.predict(x)
r2 = r2_score(y, y_pred)
print(f"Coefficient of determination R^2 = {r2:.3f}")
# Calculate residuals
residuals = y - y_pred
# Residual plot
plt.figure(figsize=(6, 4))
plt.scatter(y_pred, residuals, color='purple')
plt.axhline(0, color='red', linestyle='--')
plt.xlabel('Predicted Values (Defect Count)')
plt.ylabel('Residuals (Actual - Predicted)')
plt.title('Residual Plot (Review Lines vs. Defect Count)')
plt.grid(True)
plt.tight_layout()
plt.show()
From the coefficient of determination , we see that the review lines explain the defect count well.
The residuals are randomly distributed, indicating the linear regression model is highly valid.
This result provides a reliable basis for predicting the relationship between review scope and outcomes in practice.
● Summary of Case 5
#In this case, we used regression analysis to visualize the relationship between review scope and outcomes (defect counts), obtaining evidence to optimize review activities and review structure.
This is a highly practical approach for delving into the relationship between “quantity” and “results” in quality activities.
Key Points for Introducing SQC (Statistics Don’t Have to Be Complicated)
#-
The essence of SQC is not to end with mere number reporting but to discover "changes" and "deviations" that lead to insights.
Even without complex formulas or models, simple visualizations such as bar charts, Pareto charts, and time-series trends of metrics can deliver ample practical impact. -
Most of the cases covered in this series consist of basic descriptive statistics + simple methods, such as ratios, densities, and difference tests.
You don’t need complicated theories; what matters is the ability to gain insight, make judgments, and provide explanations. -
Also, don’t try to complete SQC individually.
By incorporating regular sharing, visualization, and review mechanisms within the team, you can more easily cultivate a culture of continuous quality improvement.
In SQC, the attitude of “observing, thinking, and dialoguing” is far more important than statistical knowledge.
Concluding This Series
#Thank you for following the "Let's Talk About Statistics" series over 15 installments.
This series has focused on how statistics can be useful in software quality settings, introducing both fundamental statistical methods and real-world case studies.
I hope you have learned not just theoretical explanations but also the concepts and techniques to visualize quality and turn it into improvement.
Statistics are tools for "observation, insight, and dialogue," more than for complicated formulas.
While this marks a pause for the series, I plan to share additional statistical and quality management topics useful for practice and education as special editions as needed.
Please continue to expand the culture of talking about quality using statistics in your own workplaces.
I've compiled related statistical information here.
I hope you find it useful for your data analysis.