Sunday, September 25, 2016

A perpetuum mobile of data – the essence of the IT revolution

The essence of the Information Technology revolution, the engine that propels it, is the reality in todays' information systems, of data bring about more and more data in a closed self-amplifying loop: the data invite applications, applications bring users, users attract new service ideas, new services create more operations & management data, and so forth.

Data is the raw materials of the information industry, understanding that makes one appreciate the huge opportunity of free materials that this industry enjoys (remark: of organization information infrastructure does cost, but it is regarded as a general investment or overhead).

The problem with the virtual data assets is that most of them are intangible, i.e. do not have specific registered value in the accountancy. Hence managements may miss their existence. As long as the organization's competitors are sleepy, the waste does not really hurt and usually goes unnoticed by managements. But, the minute somebody else in the branch is starting to use information for strategic advantage, the rules of game are changing, forever.

For example, the story of the meteoric rise of Netflix to world leadership in movies supply by the web. Prior to the foundation of Netflix on 1997, the market was dominated by Blockbuster that was not inclined to adopt advanced technologies, in contrary to Netflix that was quick to employ new techniques and data from operations, for their "agile" business development. Blockbuster simply stayed behind. They did not have much chance to close the widening gap. It does not help in this case, even if a company is big, strong, reputable and internationally spread as Blockbuster was.

Edith Ohri
Home of GT data mining
Sep.2016

Friday, September 23, 2016

Is Machine Learning chasing its own tail (of presumptions)?

Machine Learning (ML) as a method of learning is indeed a machine, i.e. it operates consistently, repeatedly and predictably, by a designed method, which is made for specific conditions; but its "learning" part is more like "training" or "verification" rather than the acquisition of new knowledge that is suggested in this name. Practically speaking, ML is made to improve prescribed response formulas, not to invent such formulas, (and I know the statement might be seemed controversial) not even to correct them.

Here is then my take on the issue:

Law #1 A dog (or a cat) chasing its tail for long enough time will eventually catch it.

Law #2 The catching will heart!

Law #3 Getting painful results will not stop the chase; it will stop only due to boredom or the exhaustion of all energy resources.

Thursday, September 22, 2016

The law of Large Numbers fails in big data

The law of Large Numbers is often regarded as a sort of "law of nature" by which variables' averages always gravitate to fixed clear values.

The question is, does the law of large number hold true in the case of big-data?

The key to the answer is in the law's underlying assumptions regarding sample-representation and data stability. One of the qualities that signify big data is Volatility. Volatility thrives in large multi-variant and closely-packed interrelated events that usually exist in big data, and it is the dynamics that follows which interferes in the convergence of averages and prevents it from happening.

In my view, even if the law of large numbers was true for big data, it would not have been of much use, due to its focus on common "average" behavior that is already known, rather than on irregularities and exceptions that are yet unknown and requiring research, such as in the study of early-detection indicators, adverse-effects, fraud-detection, quality-assurance, customer-retention, accidents, and long-tail marketing – to mention a few. Long-tails for example, consist of overlooked hidden phenomena, thus their discovery has to look, by definition, elsewhere then the already considered law of large numbers.

The above weak points of the law of large numbers, are just a small part of analytics "peculiarities" that can be expected in big data.

This paragraph is the first in series of assays on a proposed new concept of science in view of the IT industrial revolution.

Sunday, September 27, 2015

GT data mining demonstration - finances

Prediction of the daily US $ up/down change

Edith Ohri, edith@fabhighq.com

The goal

In this demo the goal has been to predict with 55% accuracy, the next day's $ direction (if it is going to be UP or DOWN).

To attain this, one is required to establish objective rules and formulas that are independent of the specific input.

How it works

a - Deciding on input (from already existing available sources)

b - Defining with GT the patterns of behavior (= groups) and cause-effect formulas.

c - The above consist an expert system that is used then for early alerts and real time decisions.

d - The expert system can improve itself and periodic reviews it rules.

Note: GT's formulas can be integrated in the control of almost any product.

The data set

The data include 760 daily records over two and a half years period, and 7 variables: Date, Open price, Close price, High, Low, and an index named RSI (Relative Strength Indicator - it compares the magnitude of recent gains to recent losses in an attempt to determine overbought and oversold. When it goes above 70 or below 30, it indicates that a stock is overbought or oversold and vulnerable to a trend reversal)

Rem.: Trade Volume information could not be attained in this demo.

On top of the 7 basic variables, another 30 or more calculated variables were added, such as Trends, Week Days etc.

The Test set includes 122 records from the end of the period.

Figure 1 Input records

The GT Learning results

First thing is creating a lower hurdle, which is "the best results that can one can achieve without the GT algorithm.

Here the lower hurdle was 55.7% right predictions in the Test set, and 56.6% in the Learning set.

Rem.: the good results are credited to the discovery of typical Weekdays' Close price changes.

Reaching beyond the assigned target

The assigned target of 55% prediction success was achieved, but it can be further improved with the GT Patterns-of-Behavior definition.

It is well known low (and quite intuitive one) which says that a greater precision can be always attained by adjusting the prediction factors to the subgroups of a given dataset. Following is a short demonstration of this low, by employing the special abilities of GT algorithm.

GT Results*

(* Initial results, for this demonstration)

Count of true/false predictions:

Right - 59%

Wrong - 41%

A 3% rate of improvement in right prediction was achieved in just the beginning of the GT process.
In a full data mining and input that includes detailed transactions, further significant improvement can be expected.

Improvement tips

1. Include non-linear variables if there are, for example "RSI" – a non linear Relative Strength index, that describes the pressure on prices due to excess Demand or Supply.

2. Split the data to hierarchical patterns of behavior.

3. Avoid "overfitting" by assuming new subsets of data once exhausting their information.

Conclusion of example demo

GT proves effective in predicting the daily USD trend.

Finding the patterns (clusters) enables separate prediction to each segment and a greater precision.

GT success is in its Industrial & Management Engineering roots

a. Its first application was on-the-job where the assignment was much practical, to improve the line work-flow, not to invent a theoretical model.

b. Industrial Engineers are almost never expert in the area of application, therefore the model needed to be strengthened with scientific internal validations.

c. As often done in IE the development was carried out without investors. That fact enabled a very long incubation period and the evolvement of important personal experience.

d. The IE practical approach led to focusing on "discovery of hidden patterns", instead of the more academic approach that prioritizes correlations and the speed of execution.

e. Full cycle product costs of implementation are considered, no hard sell wizardry.

f. Real work forced starting the algorithm ahead of time, which turned out to help greatly to avoid conventional misconceptions...

g. Product development means primarily its work method substantiation, not its market-share.

h. From IE perspective it is only natural to offer an option of SaaS.

i. IE should always adhere to the actual implementation on top of business musts.

j. High-tech or not "we do business the old way, we earn it".

~~~~

Edith Ohri, Home of GT data mining

Tuesday, September 15, 2015

Digging in financial data

Using any data for in-depth conclusions

Lessons from a GT study* of 1,000 NYSE companies from year 2000, just before the dot-com bubble crash.

---------------
* https://docs.google.com/file/d/0B1tc2-
duf3_4YzM2M2M2OWMtZjAwNS00Y2FlLWJhOWUtOTc3ZjM3NTY1YzVm/edit?usp=sharing

Conclusion 1

A pattern of behavior can be as small as a fraction of 1% of the total number of events.
In this study, GT found a tiny subgroup containing only 4 out of 1000 "exception" companies. It consists of 4 banks with the exception feature of very high net profit - twice as much as others in the financial sector. An explanation to their unusual high performances was offered 8 years(!) later, during the 2008 credit /derivatives crisis, when the 4 banks' names were mentioned in news headlines.

Conclusion 2

Large data sets require a general view on top of the detailed one.Here the general view fit almost exactly the common Industries definition. There is only one difference, yet a most significant one, some giant corporations are found to behave like financial institutes rather than their own Industries. This observation strengthen our understanding of the 2008 crisis.

Conclusion 3

Big data is about using AVAILABLE unsupervised data, without cleaning as commonly suggested.
This study is based on free data from http://www.ics.uci.edu. The data quality seems insufficient for research: there is no historical "depth", no shares value information, and the sample does not reflect the subgroups. Yet, GT turned quite good results. It means that data are useful even if partial and unsupervised!

Thursday, June 13, 2013

Some thoughts on big data challenges

Here is a list of challenges from my personal encounters with the subject:

How to make use of unsupervised data?

Untangling mixed phenomena

The need for on time (unexpected) decisions

Identifying "black swans"

Deploying legacy data - this is similar to #1 using unsupervised data

Devising a method for exponential growth of data

Using old tools in a new environment

Is there any size that is too big to handle?

Statistics in a dynamic reality

What would be considered a right hypothesis?
(or is there such a thing as a wrong question to ask?)

~~~~~~~~~~~

A Buddhist story about blind men trying to describe an elephant:

Five blind people were asked to describe an elephant. Each felt a part of the elephant. One person felt the elephant's trunk and said it is just like a plow pole. A second person touched the elephant's foot and said it is just like a post. A third person felt the elephant's tusk and said it is just like a plowshare. A fourth person had a hold of the elephant's tail and said it is just like a broom. A fifth person felt the elephant's ear and said it is like a winnowing basket. As each one described the elephant, the others disagreed...

http://www.yoism.org/?q=node/221

Wednesday, June 5, 2013

First introduction

Why GT, and why data mining at all?
My quest for mining algorithm started a long while ago. I sort of grew up with that field. It intrigued me to know, how natural formation of data (clusters) occur? Are there any principles? And how may one make use of them?
In this blog I'll try to write about these and other subjects of what makes data mining tick.

Thanks to Avishai Schur from FabHighQ for encouraging me to open this blog. See also presentation list and posts in Hebrew at http://gtdatamning-heb.blogspot.co.il/

Your comments are appreciated.
Edith

What is GT and what does it stand for?

GT is a solution for creating new hypotheses based on identifying patterns of behavior. The special thing about it is hierarchical clustering and analytics (analysis) of unsupervised data.

Origins

The name GT stands for Group Technology that is an old method of Industrial Engineering aiming to increase the efficiency of production and material handling by grouping items according to their similarity. In today's work environment its function is extended from the original shop-floor management to the management of "any type of database entities". GT can be regarded in this sense, as the abstract/universal generalized model of the old Group Technology.

Group Technology consists of several methods that were developed through the years, starting in World War II when the Russians needed to relocate their factories and move them to the East where they could be safe from the advancing German army. Their idea was to keep the different product lines in a simple order that would be quick to reconstruct. That order which they defined resembled the western "production line" approach, with one difference - instead of work-orders for identical items, the Russians allowed Groups of mixed items that shared the same processing route.

Evolvement

The Group approach gained more and more appeal in the West due to (to the best of my knowledge) two emerging technologies that later on swept the manufacturing world:

(a) Operations Research with its efficiency optimization – one should mention a prominent professor at Cranefield University England – Sir John Burbidge, who was knighted by the Queen for his activity in this field.

(b) Flexible Manufacturing developed in Japan as part of the CNC and cell-production concept.

Both technologies – Operations Research and Flexible Manufacturing, had to deal with increasingly diversified products and activities, for which the flexibility embedded in the multi-functional Groups, had a tremendous advantage compared to the rigid idea of dedicated mass production lines.

Then in the 80's, a third leap occurred that brought forward the GT idea as a desirable solution - the IT revolution. IT has introduced 'information' as an item by itself (not just adjacent to 'real' items) and by this it opened the door to many new products and changes in the organization and the whole commercial scene. As IT redefined almost everything it needed also to rebalance and regain efficiency, and the GT ability to organize the work in Groups or Clusters according to processing sequence, has proven more valid than ever. This need to reorganize production was the initial aim of my GT data mining algorithm.

All the above-mentioned upheavals were, as it appears, just an introduction to what comes to be known as Big Data. Big Data poses new challenges to data mining analysts, and mostly two features which have now become critical – AI and automation.

But how to generate AI rules automatically? Can we replace the expert in creating insights, observations, and new hypotheses?

For testing purposes we have well developed methods, but for creating hypotheses (to the testing) - nothing!

This statement deserves a whole discussion of its own. For a start, the basic solution of GT data mining is about making new rules and validating them methodically.