A dose of reality
Sometimes I argue with mathematicians about DSP. My view of DSP is an engineering one: firmly grounded in the idea of an objective reality. Signals, for me, signify something about the ‘real’ world - the results of measurements, or encoded messages that have some ‘real world’ meaning. Wikipedia agrees and in fact goes a bit further in saying:
“In the context of signal processing, arbitrary binary data streams are not considered as signals, but only analog and digital signals that are representations of analog physical quantities.”
Some mathematicians (and some DSP engineers) argue that the data in DSP is ‘just’ a set of numbers. Those numbers may represent some samples or properties of some thing in the real world, but once represented as a set of data they are ‘just numbers’.
There is some truth in this, but I think it is a dangerous over-simplification.
Let’s take a simple application to explore the idea.
Suppose that I have a signal. We start with a set of samples of this signal.
Now these numbers represent measured values of the ‘real world’ signal: but let’s take the pure mathematician’s argument and say that now they are a set of data, that meaning is somehow stripped away and we can treat them as ‘just numbers’. A set is a collection of things, so in this case to say the data is ‘just a set of numbers’ means simply that we have a collection of numbers. Let’s make a plot of these numbers. That sounds a simple enough thing to do but it isn’t. The plot has to be some view of the numbers, some way of ordering them or their properties. I might want to plot the numbers as a time sequence, to produce a sort of graph of the number against the time of the sample. But I can’t do that because this is just a set of numbers, so their order means nothing: a set is a collection, with no implied sequence. That is, we seem to have thrown away something vital - the order in a time sequence of the measurement time for each sample. I can readily calculate and plot some things - for instance a histogram plot of the frequency of occurence of each number, or the average, or variance - but not the sequence.
It is easy to modify our position. We retreat from saying these are ‘just a set of numbers’ to say they are ‘just an ordered sequence of numbers’. That is, we admit that saying it is ‘just a set’ was too strong. I think we found something important here: that seemingly simple statements may be misleading or untrue. We need to be VERY careful when we claim anything is ‘just’ something, that it really is ‘just’ that and not something more - something implied and not explicit. This is obvious really, and the pure mathematician is not wrong - just defining a different problem domain than we were thinking of. In this case we effectively assign a ’sequence number’ to each sample. If you use Matlab or Octave then you can simply do:
plot( signal )
and you will get a nice plot of the samples against sequence number.
But the sequence number does not necessarily relate to any real world sequence. In the special case where we might measure the voltage from a microphone at regular intervals, and store each such measurement in an ordered sequence whose order is the same as the order in which we took the measurements, then the sequence corresponds (with some conversion factor of the sample interval) to time. But let’s think about another common and simple set of data - a set of GPS co-ordinates. GPS data has four values for each sample - the time of the measurement, the latitude, longitude and altitude. It is very common for a single GPS ‘track’ to be an ordered sequence of these 4-dimensional measurements: in which case we can rely on the sequence to plot things like position versus time (the track), or altitude versus time (the ascent). Suppose, however, that we are not using the GPS to track the path we take in time, but to map out the the steepness of ascent of a path up a hill. In this case the time is not all that helpful - and in fact we might have re-visited the site many times to take more readings, or had several people walk the path together, and merge those measurements into a single set. How should we merge the data? Into what new sequence should we sort the samples to make this new single merged data set? Of course we don’t need to sort the samples into a sequence - each sample is completely defined by its four values, so we can simply merge them in any way that is convenient - appending, or sorting according to altitude, or according to time, or randomly. Now we can plot the data in any way we want: ordered according to time, or altitude, or whatever. If you use Matlab or Octave, you don’t even need to order the samples, just specify what is to be plotted against what:
plot( time, altitude )
This is because the GPS data samples are complete in a way that the microphone samples are not, and so can be ordered and otherwise processed on their own without any extrenal knowledge being implied. The microphone samples recorded only the voltage, and we derived the time implicitly from their sequence and our knowledge that they were measured at regular intervals (and presumably starting at some specified time). So with the microphone samples, to plot them against time we would have to introduce the sample interval and start time, and then rely on the sequence being ordered in time.
The idea here is that each sample is a measurement, and so should be self-contained - complete. The GPS samples are complete in this sense. Here is a simplfied extract from a GPX file:
time time unit latitude latitude unit longitude longitude unit altitude altitude unit
09:59.5 HHMMSSSS 51.30797833 degree -0.59225 degree 68 m
10:19.0 HHMMSSSS 51.30799167 degree -0.592543333 degree 68 m
10:19.9 HHMMSSSS 51.307975 degree -0.592531667 degree 68 m
10:20.9 HHMMSSSS 51.30797333 degree -0.592526667 degree 68 m
10:21.9 HHMMSSSS 51.30797833 degree -0.592521667 degree 68 m
10:22.9 HHMMSSSS 51.30798 degree -0.592515 degree 68 m
You can see here that all the information is contained in each line of the table (a line being one sample). You can also see that each sample has metadata that defines the units used for the measurement: so in principle we could merge data sets using SI units, or Imperial units (feet) or US units, or cm instead of m, because each sample not only contains all the values that completely determine it but also the units by which we can relate it back to the real world - and just as important, by which we can check that we are only adding or multiplying values that were measured using the same units.
This is now very far from the data set being ‘just numbers’.
In the world of Measurement Science, each measurement would also be accompanied by some estimate of its uncertainty (remember at school you always had to draw the ‘error bars’ on a graph?) and in DSP that uncertainty would be valuable as an estimate of the ‘noise’, so we could determine what we might need to do to refine the measurements - for example what filter or other process might we need to design in order to reduce the measurement uncertainty to some specified level?
In fact the Measurement Institutes would also insist on an ‘audit trail’ - traceability of the measurement, and backwards through any subsequent processing, so that the eventual results could always be checked and traced back to their origin. A lot of work on this sort of thing was done during the 1980s and 1990s (for example Gary Kopec’s 1992 paper on ‘Signal Representations for Numerical Processing’ in Oppenheim’s book on ‘Symbolic and Knowledge-Based Signal Processing’) but seems since to have been dropped in favor of much more abstract mathematical representations.
The consequences of accepting that signals are not ‘just numbers’ are significant and important.
First, a sample should be completely determined - that is, it should specify all the values including for instance the time at which the measurement was made. In the special case where samples are taken at regular intervals and stored in a sequence ordered according to the time of measurement, we can condense the representation by referring to a sample interval and starting time, and an explicit statement that the sequence is that of time. Even in this special case the necessary data (sample interval, start time) and asumptions (sequence is in time order) should be explicitly stated and encapsulated with the signal. It is careless but sadly common for a program not to show a sample interval or rate, but to rely effectively on that information being in the head of the programmer.
Second, metadata including measurement units should be encapsulated with the signal - and programs should ideally check for mistakes such as mixing values that are in different units (eg metres and feet).
Third, estimates of uncertainty should be encapsulated with signals - and should ideally be updated as processing is applied that may change the uncertainty.
Fourth, signals should encapsulate an audit trail so that results are traceable back to their origin and through different versions of processing.
How closely does what I suggest match to what is usually done in practice? In many environments, hardly at all. That may be a consequence of deliberate decisions to not take on the overhead of keeping the additional data and metadata, and the work of checking and dealing with it. Or it may be a lack of understanding, due to the creeping advance of abstraction. Or it may just be that languages like Matlab make it very easy to process abstract arrays and matrices of ‘just numbers’ and so everyone got careless. I think the DSP industry is due for an overhaul, and this integral connection back to the real world and all that it implies is one of the most important flaws that needs to be addressed.
Comments
Post a Comment