In the last picalike workshop the data analysts and scientists Bendix Sältz and his colleague Dr. Christoph Ölschläger took us into the world of data, analysis and statistics. The topic “Predictive Analytics with Python” was not only discussed in theory, but we also started directly via Jupyter Lab with data analysis in Python.
These 4 steps should be followed for a good “prediction”:
- First of all, one must think of a specific question: What question do I want to answer with my analysis? When I am clear about my actual goal, I need to collect the right data. Everyone talks about big data, but I often don’t need the bulk of the data for my question. Therefore I have to “collect” or request exactly the data I actually need for my analysis.
- The second step is to clean up the data. In the workshop we worked on a CSV. We read the data and eliminated all information that did not seem important for our question. Afterwards we processed the data in a way that we had actual values for each field of the CSV that we could work with. So for example we “decoded” text fields.
- Depending on the question, we then decided on a model and framework to answer it. In the workshop we talked about the random forest and about different regression models, such as the linear regression model.
- In the fourth step, the model must then be interpreted to derive a prediction. This prediction should then be prepared visually and entered in a presentation that is easy to understand (not only for statisticians).
It is a capital mistake to theorize before one has data.
Sherlock Holmes, “A Study in Scarlett” (Sir Arthur Conan Doyle)
And here are our Top 4 Take Aways straight from the workshop:
- Data processing takes time
Without clean data, no good model can be created. Clean data is absolutely necessary to understand the problem. This takes time, but must be done.
- Not always good models
A good model cannot be built on all data sets, no matter how large they are. Sometimes the predictors are not meaningful.
- The more data, the better
Small data sets are subject to greater fluctuations (law of large numbers). But: Large data make handling more difficult.
- Accept disappointments
Sometimes you can turn a dataset back and forth as often as you like: It is simply not possible to make any specific predictions about the initial question. But: By analyzing the data, you can still gain insights into many things and you can draw your own conclusions and possibly answer a different question.