26 August 2006

Math-quandary revisited

let's forget the circular orbits
for now

and start with the commonest word
(probably 'the')

and stretch a separate springy elastic
from 'the' to each other word

with the preferred length
determined by the observed average distance
between occurrences of these words
compared to the expected average distance
based on simple frequencies alone

so that pairs that tend to occur
closer together
will have shorter elastics
and pairs that occur farther apart
longer elastics

and we link every possible wordpair
by this metric
with the least likely pairs
linked by half-million-mile elastics
so the network as a whole
fills Moon's orbit

(not just the flat disk
but an orbit-sized sphere)

now
we can hope
the Yahoo clusters
will be well-separated in this space

and we can guess
the commonest words
will be pulled towards the center

and if we now
flatten the sphere

we ought
should
may
be able to
restore each word
to its frequency orbit
(most frequent closest
least frequent farthest)

without disrupting
the topical clustering...?