N2D3P9: Difference between revisions

Line 57:

Like the frequency of use of letters in an alphabet, when sorted in order of decreasing popularity, the ratios obeyed an approximate [https://en.wikipedia.org/wiki/Zipf%27s_law Zipf's law] distribution, with the Nth most popular ratio having votes proportional to approximately <math>\frac{1}{N^{1.37}}</math>. This meant that about half the ratios had only one vote each, and three quarters of them had 3 votes or less. Such low numbers of votes meant that the data on the less popular ratios was vulnerable to "historical noise". In other words, the position of such a ratio in the list might not be a good predictor of its relative frequency of use in the future.

[[File:ZipfLikeScalaArchiveOcurrences.png]]

In the early stages of Sagittal design, when allocating symbols for the most popular ratios, we could rely on the Scala archive data, but when we moved on to less popular ratios we needed some "less noisy" way to rank them.

Line 66:

Line 70:

<li value="c">is sufficiently simple, having only two parameters, that it cannot be [https://en.wikipedia.org/wiki/Overfitting overfitting] the data, and should therefore serve to average out the historical noise in the ranking of the less popular ratios, including ratios that do not occur in the archive at all.

</ol>

[[File:N2D3P9vScalaOcurrences.png]]

== Development & Discovery ==

@@ Line 57: / Line 57: @@
 Like the frequency of use of letters in an alphabet, when sorted in order of decreasing popularity, the ratios obeyed an approximate [https://en.wikipedia.org/wiki/Zipf%27s_law Zipf's law] distribution, with the Nth most popular ratio having votes proportional to approximately <math>\frac{1}{N^{1.37}}</math>. This meant that about half the ratios had only one vote each, and three quarters of them had 3 votes or less. Such low numbers of votes meant that the data on the less popular ratios was vulnerable to "historical noise". In other words, the position of such a ratio in the list might not be a good predictor of its relative frequency of use in the future.
+[[File:ZipfLikeScalaArchiveOcurrences.png]]
 In the early stages of Sagittal design, when allocating symbols for the most popular ratios, we could rely on the Scala archive data, but when we moved on to less popular ratios we needed some "less noisy" way to rank them.
@@ Line 66: / Line 70: @@
 <li value="c">is sufficiently simple, having only two parameters, that it cannot be [https://en.wikipedia.org/wiki/Overfitting overfitting] the data, and should therefore serve to average out the historical noise in the ranking of the less popular ratios, including ratios that do not occur in the archive at all.
 </ol>
+[[File:N2D3P9vScalaOcurrences.png]]
 == Development & Discovery ==