Thursday, June 13, 2013

Some thoughts on big data challenges


Here is a list of challenges from my personal encounters with the subject:

  1. How to make use of unsupervised data? 
  2. Untangling mixed phenomena
  3. The need for on time (unexpected) decisions
  4. Identifying "black swans"
  5. Deploying legacy data - this is similar to #1 using unsupervised data
  6. Devising a method for exponential growth of data 
  7. Using old tools in a new environment
  8. Is there any size that is too big to handle?
  9. Statistics in a dynamic reality
  10. What would be considered a right hypothesis?
    (or is there such a thing as a wrong question to ask?)

~~~~~~~~~~~
A Buddhist story about blind men trying to describe an elephant:


Five blind people were asked to describe an elephant. Each felt a part of the elephant. One person felt the elephant's trunk and said it is just like a plow pole. A second person touched the elephant's foot and said it is just like a post. A third person felt the elephant's tusk and said it is just like a plowshare. A fourth person had a hold of the elephant's tail and said it is just like a broom.  A fifth person felt the elephant's ear and said it is like a winnowing basket. As each one described the elephant, the others disagreed...



Wednesday, June 5, 2013

First introduction

Why GT, and why data mining at all?
My quest for mining algorithm started a long while ago. I sort of grew up with that field.  It intrigued me to know, how natural formation of data (clusters) occur? Are there any principles? And how may one make use of them?
In this blog I'll try to write about these and other subjects of what makes data mining tick.
Thanks to Avishai Schur from FabHighQ for encouraging me to open this blog. See also presentation list and posts in Hebrew at http://gtdatamning-heb.blogspot.co.il/
Your comments are appreciated.
Edith
What is GT and what does it stand for?
GT is a solution for creating new hypotheses based on identifying patterns of behavior. The special thing about it is hierarchical clustering and analytics (analysis) of unsupervised data.
Origins
The name GT stands for Group Technology that is an old method of Industrial Engineering aiming to increase the efficiency of production and material handling by grouping items according to their similarity. In today's work environment its function is extended from the original shop-floor management to the management of "any type of database entities". GT can be regarded in this sense, as the abstract/universal generalized model of the old Group Technology.
Group Technology consists of several methods that were developed through the years, starting in World War II when the Russians needed to relocate their factories and move them to the East where they could be safe from the advancing German army. Their idea was to keep the different product lines in a simple order that would be quick to reconstruct. That order which they defined resembled the western "production line" approach, with one difference - instead of work-orders for identical items, the Russians allowed Groups of mixed items that shared the same processing route.
Evolvement
The Group approach gained more and more appeal in the West due to (to the best of my knowledge) two emerging technologies that later on swept the manufacturing world:
(a) Operations Research with its efficiency optimization – one should mention a prominent professor at Cranefield University England – Sir John Burbidge, who was knighted by the Queen for his activity in this field.
(b) Flexible Manufacturing developed in Japan as part of the CNC and cell-production concept.
Both technologies – Operations Research and Flexible Manufacturing, had to deal with increasingly diversified products and activities, for which the flexibility embedded in the multi-functional Groups, had a tremendous advantage compared to the rigid idea of dedicated mass production lines.
Then in the 80's, a third leap occurred that brought forward the GT idea as a desirable solution - the IT revolution. IT has introduced 'information' as an item by itself (not just adjacent to 'real' items) and by this it opened the door to many new products and changes in the organization and the whole commercial scene. As IT redefined almost everything it needed also to rebalance and regain efficiency, and the GT ability to organize the work in Groups or Clusters according to processing sequence, has proven more valid than ever. This need to reorganize production was the initial aim of my GT data mining algorithm.
All the above-mentioned upheavals were, as it appears, just an introduction to what comes to be known as Big Data. Big Data poses new challenges to data mining analysts, and mostly two features which have now become critical – AI and automation.
But how to generate AI rules automatically? Can we replace the expert in creating insights, observations, and new hypotheses?


 For testing purposes we have well developed methods, but for creating hypotheses (to the testing) - nothing!
This statement deserves a whole discussion of its own. For a start, the basic solution of GT data mining is about making new rules and validating them methodically.