IT Industry Trivia: When Two 'Justices' Collide: The True Nature of Standards Schisms Hidden in Software
Back to Top
To reach a broader audience, this article has been translated from Japanese.
You can find the original version here.
Introduction: The "Two Standards" Issue Lurking in the Industry
#In the world of software development, the phenomenon of "standards schisms"—where specifications or behaviors split into two—often occurs.
It tends to prompt the question, "Which one is correct?", but in reality, each simply has its own historical and technical background.
This time, we'll introduce some examples related to these "two standards."
Arrays: Starting at "0" or "1"? ~ The "Conventional Wisdom" That Changes with the Language ~
#Depending on the programming language, the starting point for array indices is broadly divided into three patterns.
● "0"-Based Languages
#Most modern mainstream languages start array indices at "0". This is because it aligns well with the hardware.
Representative language examples:
- C / C++
- Java / Kotlin
- Python / JavaScript
- C# / Go / Rust
- Swift / Ruby / Perl
In these languages, the first element of array a
is accessed with a[0]
.
This is based on the low-level design concept of referencing by adding an offset (0, 1, 2, ...) to the array's base address.
For example, let's say the array a
is stored in consecutive memory as follows:
Index | Memory Address (Example) |
---|---|
a[0] |
0x1000 |
a[1] |
0x1004 |
a[2] |
0x1008 |
The important point here is:
The actual access to
a[n]
is based on "base address + n-th offset."
In other words, a[0]
means "the 0th element from the start (=the start itself)", which is very natural for pointer-arithmetic-based languages like C.
Also, at the machine-code level, using 0 as the "reference" is easier to handle, making 0-based indexing more efficient.
● "1"-Based Languages
#Languages that prioritize mathematical consistency adopt 1-based indexing to match the natural numbers.
Representative language examples:
- FORTRAN / R / COBOL
- Lua / MATLAB / Julia
- Smalltalk
In mathematics, array (or sequence) indices usually start at "1":
- Matrices:
- Summation notation:
- Fibonacci sequence:
This is based on the intuitive notion of "counting", where the "first item is the 1st."
In languages like FORTRAN and R, which focus on scientific computing and formula processing, 1-based indexing feels more natural.
For example, in R, x[1]
refers to the first element.
This matches the mathematical intuition (the first is number 1) and is natural for representing matrices and sequences.
● Languages with Customizable Index Ranges
#There are also languages that allow flexible specification of index ranges.
This design lets you choose either 0-based or 1-based indexing.
Representative language examples:
- Pascal (range specification like
array[1..10]
is possible) - Ada (explicit definition such as
array(0..9)
orarray(1..10)
) - Fortran (you can specify the starting point like
dimension(0:9)
) - VB.NET (you can change between 0-based or 1-based using
Option Base
)
This flexibility is helpful for index design that suits specific needs.
● Impact of Choosing an Index Starting Point
#- "0"-based: emphasizes memory efficiency and low-level operation optimization
- "1"-based: emphasizes mathematical naturalness and formula readability
- Customizable: enables flexible, high-abstraction-level design
The key is not which is "correct", but that the appropriate choice is made based on the purpose and context.
In Visual Basic (VB), there are differences in array index specifications depending on the version. Declarations like Dim a(10)
can be interpreted differently depending on the language or settings, so caution is needed.
- In VB6 and earlier,
Dim a(10)
creates an array with 11 elements indexed from 0 to 10. If you specifyOption Base 1
, you can change the starting index to1
. - In VB.NET,
Option Base
is ignored, and it is always 0-based. SoDim a(10)
means an array with 11 elements indexed from0
to10
.
In this way, C and Python specify "the number of elements", while VB specifies "the upper bound index", and the two approaches are fundamentally different.
Example: Dim a(10)
- VB.NET →
a(0)
througha(10)
for 11 elements - C / Python → indices 0 through 9 for "10 elements"
Even seemingly similar syntax can have reversed meanings depending on the language, so special care is required in operations involving array sizes.
● Summary of Array Index Differences
#- "0"-based is a design philosophy emphasizing implementation efficiency based on hardware structure and pointer offsets
- "1"-based is a philosophy emphasizing mathematical naturalness and readability, a legacy of formula-oriented culture
It's not about which is "correct", but that the "appropriate" choice varies by purpose and background.
0-based vs 1-based reflects differences in where you place the axis of your worldview.
Byte Order: Big Endian vs Little Endian
#Byte order (endianness) refers to the difference in which byte to store first when placing multibyte data (e.g., 16-bit, 32-bit, 64-bit integers) into memory.
● Big Endian
#- Definition: Store the Most Significant Byte (MSB) first (at the lower memory address)
- Examples of use: Network communications (TCP/IP standard), some RISC architectures (SPARC, older PowerPC, etc.)
Example: Storing 0x12345678 (in order of increasing addresses)
Big Endian: [0x12][0x34][0x56][0x78]
- Philosophy: Designed to be intuitive to read by placing higher-order bytes first, similar to decimal notation
- Background: Some instruction set architectures were designed so that "opcode → operand" appear in that order
An early CPU that adopted Big Endian was the Motorola 6809 (1978). It was later followed by the widely used 68000 series (1979), powering many commercial machines like the Apple Macintosh.
● Little Endian
#- Definition: Store the Least Significant Byte (LSB) first (at the lower memory address)
- Examples of use: Intel CPUs (x86, x86_64), ARM (default is Little Endian)
Example: Storing 0x12345678
Little Endian: [0x78][0x56][0x34][0x12]
- Philosophy: Optimized for processors that often process from the lower-order byte first in numeric operations
- Background: Some designs favored reading only the lower-order byte first for operations like jumps
● Why Did They Diverge? Differences in Design Philosophy
#This difference is not just a matter of taste but stems from different priorities in early hardware architecture designs.
- Big Endian: Prioritizes readability for humans and visibility of instructions (e.g., opcode first, then high-order operands)
- Little Endian: Optimized for the processor's convenience in handling arithmetic operations from the lower-order side
● Challenges in Mixed Environments
#In cross-network communication or binary file interoperability, mismatched endianness can cause bugs or data corruption.
For example:
- TCP/IP uses Big Endian (network byte order) by standard
- Windows binary files use Little Endian (Intel)
- Sending a struct between different endianness environments can scramble field interpretations
● Handling Different Endianness
#In environments where Big Endian and Little Endian coexist, explicit conversion or adjustment measures are required. Common approaches include:
- Using APIs like
htonl()
/ntohl()
to convert between host and network byte order (common in C and system programming) - Specifying endianness in file formats
Example: WAV, PNG, TIFF, etc., include byte order in their specifications - Agreeing on order at the protocol level
Example: Protocol Buffers and MessagePack are designed to be used without worrying about byte order
Recognizing differences in endianness and handling them explicitly is crucial for building reliable software.
● Origin of the names
"Big" and "Little" come from the egg-breaking debate in Jonathan Swift's Gulliver's Travels, about whether to break a boiled egg from the big end or the small end (Big-Endian vs Little-Endian).
It symbolizes "serious conflict arising from trivial differences."
Stack Argument Ordering: From Front or From Back?
#When calling a function, arguments are usually pushed onto the stack and passed that way.
● What is a stack?
A region in memory that manages data in a Last-In-First-Out (LIFO) manner. Function arguments, return addresses, and local variables are stored here.
A new stack frame is pushed for each function call and popped off when it completes.
There are two key points to watch:
- In what order are arguments pushed onto the stack—front (left) to back (right) or back to front?
- After the call, who cleans up the argument area on the stack?
- The caller
- The callee
These differences are called "calling conventions" and vary by architecture and platform.
If not understood correctly, the stack can become corrupted after a function call, leading to unexpected behavior or crashes.
● Argument Pushing: Front to Back or Back to Front?
#-
Right-to-Left (push from back):
The most common method (e.g.,cdecl
)- Works well with variadic arguments (like
printf()
) - Pushes the last argument first
- Works well with variadic arguments (like
-
Left-to-Right (push from front):
Pascal-family calling conventions (e.g.,pascal
,fastcall
)- Poor fit for variadic arguments (like
printf()
) - Pushes the first argument first
- Poor fit for variadic arguments (like
● Stack Cleanup: Caller or Callee?
#-
caller clean-up (caller cleans the stack):
Representative:cdecl
- Easier to support variadic arguments
- The caller must know the number of arguments
-
callee clean-up (callee cleans the stack):
Representative:stdcall
- Keeps the caller side simple
- The number of arguments must be fixed
● Examples of Argument Pushing (Right-to-Left / Left-to-Right)
#Consider calling sum(a, b, c)
.
■ Right-to-Left (push from back): cdecl
, etc.
int result = sum(1, 2, 3); // caller
In this case, the stack is pushed in this order:
push 3 ← last argument
push 2
push 1 ← first argument
call sum
→ On the stack, the last argument ends up on top.
→ Good compatibility with variadic arguments (e.g., printf("%d", x)
).
→ The caller cleans up the stack.
■ Left-to-Right (push from front): pascal
, fastcall
, etc.
result := sum(1, 2, 3); // Pascal-style call
In this case, the stack is pushed in this order:
push 1 ← first argument
push 2
push 3 ← last argument
call sum
→ On the stack, the argument order is preserved as-is.
→ Readable and has high visibility but is unsuitable for variadic arguments.
→ Often the callee cleans up the stack.
● Note
There are also register-based conventions (like fastcall
), where the first few arguments go into registers and the rest are pushed on the stack from right to left.
Because cleanup methods vary by convention, unified calling conventions are essential for function call compatibility.
● Performance and Compatibility
#Calling Convention | Argument Order | Variadic Support | Notes |
---|---|---|---|
cdecl |
Right → Left | Yes | Standard in C; caller cleans the stack |
stdcall |
Right → Left | No | Used by the Windows API; callee cleans the stack |
pascal |
Left → Right | No | Old Pascal-style; emphasizes readability |
fastcall |
Registers first + Right → Left | Limited | First two arguments in registers; the rest pushed right to left |
vectorcall |
Registers first + Right → Left | Limited | Actively uses floating-point/SIMD registers; introduced in Windows x64 |
- Register-based passing (
fastcall
,vectorcall
, etc.) was introduced to speed up function calls. - Different ABIs can lead to incompatibility between libraries and binaries.
- Notably, Windows and Linux have different default calling conventions (e.g., Windows uses
stdcall
-family, Linux usescdecl
-family).
● Summary of Stack Argument Practices
#- Argument pushing order and stack cleanup responsibility are defined by calling conventions.
- The appropriate choice depends on factors like variadic arguments, performance, and binary compatibility.
- In cross-platform development and interface design, it is essential to explicitly align calling conventions.
Character Encoding: UTF-8 vs UTF-16
#Differences in character encoding affect text processing, internationalization, file storage, communication protocols, and more. UTF-8 and UTF-16 are representative Unicode encodings, but they have distinct use cases and features.
● UTF-8
#- Features: A variable-length (1 to 4 bytes) Unicode encoding
- Compatibility: Binary-compatible with ASCII (0x00–0x7F); highly compatible with existing C strings
- Adoption Examples:
- Web (HTML, HTTP, JSON, etc.) standards
- File systems in Linux / macOS
- Languages like Go, Rust, Python
- Advantages:
- Compact for content dominated by English text
- Strong for communication and storage
- Disadvantages:
- Hard to perform random access (one character ≠ one byte)
● UTF-16
#- Features: A Unicode encoding using mainly 2 bytes (or 4 bytes)
- Compatibility: Not ASCII-compatible (even ASCII characters use 2 bytes)
- Adoption Examples:
- Internal Windows APIs (since Windows NT)
- Java’s
char
type, .NET’sSystem.String
- Advantages:
- Efficient for East Asian text (often fits in 2 bytes)
- Easier random access (basically 2-byte units)
- Disadvantages:
- Some characters (e.g., emojis, supplementary Kanji) require surrogate pairs (4 bytes)
- Prone to portability and binary incompatibility issues
● Surrogate Pair Issues
#In UTF-16, characters above U+10000 (e.g., some emojis and historic characters) are represented with two 16-bit values (surrogate pairs).
Programs that cannot handle this correctly may suffer from mojibake, crashes, or security vulnerabilities.
● The Cause of Confusion in Japan: Shift_JIS
#The Japanese-specific encoding Shift_JIS is still used in some contexts (legacy Windows software, emails, CSV, etc.).
- Byte-level ambiguities (e.g., 0x5C may be "¥" or "")
- Problems often arise in conversion between Unicode and Shift_JIS
- In environments mixing UTF-8, UTF-16, and Shift_JIS, mojibake, improper processing, and parsing difficulties frequently occur
● Saving CSV in Excel: Default is not UTF-8
When you save as "CSV (Comma delimited)" in Japanese Windows, it is still saved in Shift_JIS (CP932).
To save in UTF-8, you must explicitly choose "CSV UTF-8 (Comma delimited)" (introduced in Excel 2016).
Many users still save as .csv by default, assume it’s UTF-8, and encounter mojibake.
Personally, I find this behavior extremely frustrating.
● Summary of Character Encodings
#Feature | UTF-8 | UTF-16 |
---|---|---|
Byte Length | Variable (1–4 bytes) | Mainly 2 bytes (+ 4 bytes for surrogates) |
ASCII Compatible | Yes | No |
Advantages | Compact, good for communication and storage | Random access, fast internal processing |
Main Use Cases | Web, Linux, file storage | Windows, Java/.NET |
Choosing an encoding requires understanding the target system, data, and compatibility requirements. In cross-language or cross-environment development, explicit conversion and checks are crucial.
Newline Codes: LF vs CRLF
#Text file newline codes (line-ending characters) vary by OS and tools due to historical reasons, often causing compatibility or version control issues.
● Meaning of Newline Codes
#Newline Code | Symbol | ASCII Code | Meaning |
---|---|---|---|
LF | \n |
0x0A | Line Feed |
CR | \r |
0x0D | Carriage Return |
CRLF | \r\n |
0x0D 0x0A | Carriage Return + Line Feed |
● OS Standards
#OS/Environment | Newline Code | Notes |
---|---|---|
Linux | LF | Unix tradition; simple 1-byte code |
macOS (current) | LF | macOS X and later use LF (old Mac OS used CR) |
Windows | CRLF | Still maintained as standard for text files |
GitHub | LF recommended | To maintain consistency in repositories |
● Common Issues
#- diff shows changes for every line
Files appear fully changed due to different newline codes - Mojibake / build failures
Scripts created on Windows fail on Linux with^M
errors - Git management confusion
Newline changes are repeatedly detected as modifications
● Solutions
#- In Git, explicitly control newline codes with
.gitattributes
:* text=auto *.sh text eol=lf *.bat text eol=crlf
- Standardize settings in editors (e.g., VSCode, IntelliJ)
- Add newline checks in CI (continuous integration)
● Note on macOS history
- Old Mac OS (up to Mac OS 9): used CR (0x0D) for newlines
- macOS X and later (Unix-based): switched to LF for compatibility with Linux
● Summary of Newline Codes
#- Mismatched newline codes can be a major friction point in team development and cross-platform environments.
- Setting rules early and automating/enforcing them is key to avoiding pitfalls.
Floating-Point Rounding: IEEE 754 vs Commercial Rounding
#Floating-point rounding methods are critical to calculation precision and business accuracy. There are different conventions in scientific/technical contexts and commercial contexts regarding how to round halfway values (e.g., 2.5).
● IEEE 754: Round to Nearest, Even
#- Definition: Halfway values (e.g., 2.5, 3.5) are rounded to the nearest even number
- Also known as: "Bankers' rounding"
- Characteristics:
- Aims to eliminate bias in rounding direction
- Has a statistical effect of canceling out rounding errors in aggregation
- Examples:
- 2.5 → 2 (even)
- 3.5 → 4 (even)
- 1.25 → 1.2 (when rounded to one decimal place)
● Commercial Rounding: Round Away From Zero
#- Definition: .5 values are always rounded in the direction away from zero
- Characteristics:
- Closer to user intuition
- Widely used in accounting and financial calculations
- Examples:
- 2.5 → 3
- -2.5 → -3
- 1.25 → 1.3 (when rounded to one decimal place)
● Comparison and Use Cases
#Aspect | IEEE 754 (Round to Even) | Commercial Rounding (Away From Zero) |
---|---|---|
Application Domains | Science, measurement, standard computations | Accounting, sales, UI display |
Halfway Handling | Round toward even | Round away from zero |
Rounding Bias | Statistically neutral | Can accumulate bias |
● Differences by Implementation or Function
#Language / Library | Default Rounding | Notes |
---|---|---|
C / C++ (roundf ) |
IEEE 754 (round to even) | Depends on library |
Python (round() ) |
IEEE 754 (round to even) | Python 3 uses even rounding; round(2.5) yields 2. |
Excel | Commercial rounding | ROUND function always rounds away from zero |
Java (BigDecimal ) |
Selectable | You can specify RoundingMode.HALF_EVEN , etc. |
There is a difference in the round()
function between Python 2 and Python 3.
Python 2 uses round-away-from-zero for .5 values, while Python 3 defaults to round-to-even.
● Caution
#- Rounding rules can be legally defined for tax or interest calculations.
- In microsecond-level timing or simulations, accumulated rounding errors can produce critical results.
- If rounding methods are not specified in specifications, implementers may exhibit different behavior.
● Summary of Floating-Point Rounding
#- IEEE 754: Strives for statistical neutrality
- Commercial Rounding: Natural for users and suitable for financial calculations
Choose rounding rules deliberately and make it a habit to specify them explicitly.
Decimal Notation: Period vs Comma
#In numeric notation, the symbol used for the decimal point varies by country or culture. This difference can cause serious confusion in CSV parsing, Excel data interpretation, and software internationalization.
● Major Notation Differences
#Notation Example | Region/Country | Description |
---|---|---|
3.14 |
Japan, USA, UK, etc. | Uses period . as the decimal point |
3,14 |
Germany, France, Italy, Russia, etc. | Uses comma , as the decimal point |
● Thousands Separators Are Reversed Too
#Number | Period-Decimal Notation | Comma-Decimal Notation |
---|---|---|
1,234.56 | 1,234.56 |
1.234,56 |
→ Period and comma usage are completely opposite!
● Typical Problems
#- CSV files treat "3,14" as a string
In English-configured Excel or Python, commas are seen as separators, causing errors - Numeric calculations fail
Automatic parsing fails, and addition/aggregation does not work correctly - Behavior differences by Excel locale
- In Japanese/English settings, period is decimal and comma is thousands separator
- In German settings, comma is decimal and period is thousands separator
● Solutions
#- Specify locale when reading CSV (Excel or pandas)
Example:pd.read_csv("file.csv", sep=";", decimal=",")
- Design UI with user locale in mind
- Use a common internal notation (e.g., period) and convert at I/O boundaries
● Example: Reading with pandas in Python
#import pandas as pd
# Handling German-locale CSV
df = pd.read_csv("data.csv", sep=';', decimal=',')
● Summary of Decimal Notation
#- Decimal point notation varies drastically by country.
- Be especially careful with CSV and Excel. Locale misunderstandings are common.
- A practical approach is unified internal notation + locale-aware I/O conversions.
File Path Separators: Slash (Unix) vs Backslash (Windows)
#When representing file paths (directory structures), the separator character differs by OS.
These differences impact cross-platform development, shell scripts, and library interoperability.
● Path Separators by OS
#OS / Environment | Separator Character | Example | Notes |
---|---|---|---|
Unix/Linux | / |
/usr/local/bin |
Standard slash notation |
macOS | / |
/Applications/App |
Unix-based, so the same |
Windows | \ |
C:\Program Files\App |
Backslash (\ ) is standard |
Web / URL | / |
https://example.com/path/to/resource |
URLs always use slash |
● Points to Note
#-
In Windows, backslash also serves as an escape character, so be careful
- e.g.,
\n
means newline,\t
means tab - You may need to escape paths like
"C:\\path\\to\\file"
- e.g.,
-
In some languages like Python, slashes are also accepted for Windows paths
# Works on Windows too path = "C:/Users/YourName/Documents"
● Solutions & Best Practices
#-
Use language/environment-independent methods:
- Python:
os.path.join()
orpathlib.Path
- Java:
Paths.get()
orFile.separator
- .NET:
Path.Combine()
- Python:
-
Use slashes in scripts and config files to maintain Unix compatibility
-
Clearly distinguish between absolute and relative paths
Path resolution issues can often arise in shell scripts or CI/CD.
● IDE Workarounds
#Many IDEs (Visual Studio Code, IntelliJ, etc.) internally handle OS-dependent path separators.
However, be cautious with external files (CSV, batch scripts, Makefile, etc.).
● Summary of File Path Separators
#- Differences in path separators can be a minefield causing unexpected failures.
- The key is to use OS-independent APIs internally and explicit rules for external notation.
Conclusion: How Should We Handle "Two Standards"?
#As we've seen, the software development field often encounters cases of "two standards."
These are not just confusion or schisms but rather the result of optimizations based on distinct technical backgrounds, histories, and purposes.
Such differences are sometimes jokingly called "religious wars," but in reality, they mostly represent choices based on design philosophy, compatibility, and maintainability.
The question isn't "Which one is correct?" but rather
"Why is it that way?" and "How can we reconcile them?".
To operate smoothly in practice, you need two things:
- To know that there are two (or more) options
- To be flexible and adapt as needed
Rather than being swayed by specification and standard differences,
the technical and coordination skills to adapt to others demonstrate professionalism.