Thursday, May 25, 2006

Back from statistical hell

Actually, not hell. It was a fascinating two days of devling into the intricacies of logistic regression. This technique doesn't seem to be used very much in archaeology, but it's used a lot in both clinical and marketing/consumer research. It gets most of its use where there are two outcomes that need to be tested -- say, whether a medical treatment cures or does nothing, or whether a customer responds to a sales offer or not. You basically take as many variables as you can and try to build a mathematical model that predicts what cases will respond and which won't; the method also tells you the degree to which each variable drives the response. It's also used in predicting who will get a particular disease condition, and probably where most people come across it. This is where you read in medical news that such and such a factor -- obesity, drinking, etc. -- increases your risk of getting heart disease, cancer, whatever.

It's not used that much in archaeology because we're not usually concerned with outcomes, although it can be used to differentiate two different groups. For example, I have a bunch of data on lithic (stone tool) debitage (debris from the production of the tools) from Egypt's Fayum Depression and it's from two periods, what's called the Epipaleolithic -- basically Mesolithic everywhere else -- (ca. 6000+ BC and Neolithic (5200-4000 BC). The tools themselves are really distinctive, but there's a question of whether the latter was in some way derived from the former, since there's a gap of a thousand years or more. One might expect that if the same people were making both sets of tools their techniques might be similar, just with different outcomes and this might be reflected in the debitage. I also want to see if they use the small "waste" flakes simiarly or differently in either time.

I've had a go at differentiating them using discriminant function analysis -- a similar method -- but have not found them to be significantly different when using several variables. Which is, in itself, interesting. Logistic regression seems to provide a better way of interpreting the results so I'm hoping to get a bit finer of an analysis on it.