Benford_law

1938年,物理学家Frank Benford发现了一个有趣的数字规律(Benford Law) ——现实生活中数字的首字母是“1”的概率要远远大于“9”。仔细研究后发现,从1~9出现的概率符合对数分布,“1”出现的概率为30.1%,“2”出现的概率为17.6%,而“9”的概率只有4.6%。

Benford定律可以在以下领域帮助识别造假:

  1. 财务审计:通过检查公司的财务报表,找出不符合Benford定律的异常数据,从而揭示可能的欺诈行为。
  2. 社会科学研究:在数据收集和分析中,Benford定律可以帮助识别异常数据和潜在的调查错误。
  3. 选举监督:Benford定律可以用于分析投票数据,检测选举中的不正当行为。
  4. 法医科学:Benford定律可以用于分析各种类型的数据,寻找不符合预期模式的异常数据。

每个数字作为第一位的比例分别是:

1: 30.1%
2: 17.6%
3: 12.5%
4: 9.7%
5: 7.9%
6: 6.7%
7: 5.8%
8: 5.1%
9: 4.6%

在使用Benford定律进行分析之前,应先确定数据集是否适合该定律,可参考以下几个条件:

  1. 跨越多个数量级:适用于Benford定律的数据集通常包含的数值跨越多个数量级(即,既有个位数,又有十位数,百位数等)。如果数据集中的数值都在同一个数量级(例如都是几千到几万),那么Benford定律可能不适用。

  2. 没有人为限制:如果数据集中的数值受到任何形式的人为限制(例如,所有的数值都是100以下的整数),那么Benford定律可能不适用。

  3. 无偏分布:如果数据是从一个具体的无偏分布(如正态分布、均匀分布)中采样的,那么Benford定律可能不适用。适用于Benford定律的数据集通常来源于多个过程或分布的混合。

  4. 非人工选定:如果数据是由人为选定的(例如,列出的价格、固定费用等),而不是自然产生的,那么Benford定律可能不适用。

  5. 数据来源:来自自然、社会和经济现象的数据更有可能遵循Benford定律。

  6. 样本量足够大:适用于Benford定律的数据集应该包含足够多的观测值。样本量过小可能会导致不准确的结果。


In 1938, physicist Frank Benford discovered an interesting numerical phenomenon known as Benford's Law — in real life, the probability of the first digit of a number being "1" is significantly higher than that of "9". Upon closer investigation, it was found that the probability of the digits from 1 to 9 appearing follows a logarithmic distribution, with "1" appearing 30.1% of the time, "2" appearing 17.6% of the time, and "9" only appearing 4.6% of the time.

Benford's Law can be used to detect fabrication in the following fields:

  1. Financial auditing: By examining a company's financial statements, one can identify abnormal data that does not conform to Benford's Law, thus revealing potential fraudulent activities.
  2. Social science research: In data collection and analysis, Benford's Law can help identify anomalous data and potential survey errors.
  3. Election monitoring: Benford's Law can be used to analyze voting data and detect improper behavior in elections.
  4. Forensic science: Benford's Law can be used to analyze various types of data, searching for abnormal data that doesn't fit expected patterns.

The respective proportions for each digit appearing as the first digit are:

1: 30.1%
2: 17.6%
3: 12.5%
4: 9.7%
5: 7.9%
6: 6.7%
7: 5.8%
8: 5.1%
9: 4.6%

Before using Benford's Law for analysis, it is necessary to ensure that the data set is suitable for this law, by considering the following criteria:

  1. Spanning multiple orders of magnitude: Data sets that fit Benford's Law usually contain values spanning multiple orders of magnitude (i.e., single digits, tens, hundreds, etc.). If the numbers in a dataset are all within the same order of magnitude (e.g., all between a few thousand and a few tens of thousands), Benford's Law might not apply.
  2. No artificial restrictions: If the values in the data set are subject to any form of artificial restriction (e.g., all values are integers below 100), Benford's Law might not apply.
  3. Unbiased distribution: If the data is sampled from a specific unbiased distribution (e.g., normal distribution, uniform distribution), Benford's Law might not apply. Data sets that fit Benford's Law are often derived from a mixture of processes or distributions.
  4. Not artificially selected: If the data is artificially selected (e.g., listed prices, fixed expenses, etc.), rather than naturally occurring, Benford's Law might not apply.
  5. Data source: Data from natural, social, and economic phenomena is more likely to follow Benford's Law.
  6. Sufficient sample size: Data sets that fit Benford's Law should contain a large enough number of observations. Small sample sizes might lead to inaccurate results.