Sunday, August 3, 2014

word signatures including gestures

Words by themselves are very difficult for speech recognizers to work witb. Consider the phrase "eats, shoots, and leaves". There is a lot of ambiguity as far as what kind of leaves is this referring to. Does it mean to abandon or a part of a plant. Humans, even those who have "practiced" the art of being social for 10,000 hours or once a person turns 10 cannot be expected to know all there is about every idiom and ventricle metaphor since the beginning of time. Therefore something else is happening. The book "Mirroring people" poses the queston of why is it that a person gestures when they are talking on a phone and it is impossible for the other person to be able to see them. It is because we use words as a vehicle to transport the other person to a thought or feeling we wish to convey. Many nlp practitioners use stemming as a way to isolate meaning but this will not work because words without context is empty. We establish context with pitch, tempo, and liveliness. The book "Social Physics" asks us to imagine a device that isolates the spoken word from the manner in which it is spoken. For example pitch, tempo and liveliness. This will also fail to grasp the full meaning. For us to do that we need to think of a human brain and how it condenses meaning  so a person does not think too hard. The book "We are our brains" discusses one theory that a person uses the equivalent of $1500 for their entire life or 15 watts per hour.  In order to come up with a computational model we need to ask the right questions. What some large speech recognition platforms is doing by training their system by using 30,000 people voices speaking. This is wrong because we need to think of words and concepts as being unique. By over complexity we slow down the system so it is not usable in real time and requires vast amounts of computing power beyond what is in our phones. Some systems make this a crowdsourcing problem and remove the computing resources into the cloud. Computer security specialists know the problem with this approach is that while some conversations glob together some certaintly do not. This is not a matter of a big enough sample, there is nothing we can do. The book "Uncharted"explains how almost in a decade worth of Google searches some queries stand out. What my clear audio project proposes to do is tot build a blackboard like system based off how we understand the human mind to work. To do this we need to extend an artificial neural network. An artificial neural network is a single parameter engine that comes up with a single parameter result baed on recursion and removing the strands that are not used very much. While we do have need for a single parameter output we do not yet have a nicely formatted single parameter vector of input. We need a blackboard system with several parameters from multiple agents.  We can use what we learned from visual saliency to find feature sets. For example consider the book wher's waldo. First a featuee that we may associate with waldo is that he wears red. Then we look for stripes. Then we look for glasses. Then for certian we know where waldo is. This process is not unlike what we do with with clear audio. We take a vector of certainties and study the senttence trajectory. Using ziff's law we know how a topic is formed and diverged from. We know that every conversation has a topic sentence. It is unusual for a person to call another without a particular question in mind, even if it is to see how their day went. That conversation has a subect of the day.  We need to ask the contexual questions formed by Zachman's formula: what, when, who, how, and why. By studying a person's speech including pitch, tempo, and liveliness as well as the words they actually do say, future words can be predicted. By living in the experience we can allow our nlp engines to detect starcasm.