Re: Hadoop, pig, dynamical systems tools

Hi Brenda,

David Donoho has an extensive critical review of big data techniques in
http://courses.csail.mit.edu/18.337/2015/docs/50YearsDataScience.pdf

But I'd like to learn about HADOOP from your direct encounter, and understand how you think we can use such big data mining techniques to multi-dimensional activity / ensemble-body / gesture tracking.  It may be good to talk with Qiao to get his opinion on things.  I trust his scientific judgment most ...

And here's a curnudgeonly rant:

Glad to see you working thru all this,
Xin Wei

On Aug 8, 2016, at 8:40 PM, Brenda McCaffrey <brendamc@asu.edu> wrote:

Hi Xin Wei and friends,

Quick update:

*  Thanks to Connor for helping me get much of the Java script working within Eclipse.
*  I'm meeting with Mike K. tomorrow morning at 10am to look at the code.
*  I'm learning more about Hadoop and Apache Pig and it may be very exciting for us, although I have yet to find anyone in our realm who has knowledge of it.  Here are a few things I've learned:
  • Hadoop is an open source data management system that evolved from Google's behavioral search system about 10 years ago.
  • Hadoop is used by Amazon, Yahoo, Facebook and others to track and manage behavioral data.  For example, that's how Facebook knows where you move your mouse.
  • Hadoop allows massively parallel processing of large data sets.  It can be set up on unlimited low-cost computers to run infinitely parallel.
  • Apache Pig is a scripting language that manages Hadoop queries.
Apparently, Hadoop and Pig are primarily used in business systems.  The gentleman who wrote these dynamical systems codes (Jacob Perkins when he was at UT Austin), did so to demonstrate that one could solve complex scientific problems without massive data sets by using parallel processing.  Eureka!

I continue to be intrigued by this approach since it implies to me that we may be able to solve very complex, real-time dynamical systems problems by using relatively simple parallel computing frameworks in conjunction with Java.

I'm going to get the fundamental Java to work (I hope) to demonstrate it's functionality, but the real power of these programs is the ability to slice the simulations into partitions that can be evaluated in parallel to converge on the regions of interest.  This could be especially useful for time-domain data coming in from complex systems for which we don't know the closed form analytical models.  (Video?)

These are my thoughts for now.  I'm not a CS person so I may be completely in left field about this, but it looks promising.

Thanks for listening.
-Brenda


On Sun, Aug 7, 2016 at 4:10 AM, <shaxinwei@gmail.com> wrote:

Dr Brenda McCaffrey, PhD researcher at Synthesis, is working on a project to develop some dynamical system based tools for modeling human movement.

Can anyone help with the following query:

I’m converging on an approach for Java simulation of Lyapunov exponents based on an amazing set of programs created by a gentleman named Jacob Perkins (https://github.com/thedatachef/sounder/tree/master/udf/src/main/java/sounder/pig/chaos) that he developed using the Sprott algorithm.  This is really deep stuff and uses Hadoop and pig as well as Java.  Do we have resources in these areas?  A couple of hours with an expert would change my life!

Xin Wei

Crazy and spectacular info from Jacob Perkins re chaos, Lyapunov, pig and hadoop

I've been diving into Java programs written by a gentleman named Jacob Perkins (https://github.com/thedatachef) who has posted the following information on his blog regarding these programs.  His Lyapunov program in Java is based on the Sprott model described in his book, so I'm assuming for now that it is entirely appropriate for what we're trying to do.

Also, in talking with Loren O. today, he points out that we don't necessarily need to port Java code into Max in order to allow people to use it.  It is possible, and perhaps preferable, to create stand-alone Java code that can process data, return arguments, etc. and use OSC to communicate with Max and other systems.

I'm really intrigued by Hadoop and the pig scripting capabilities.  So here I am looking at fairly advanced Java programs, Hadoop and pig, and I'm a complete newby with all of these tools.  A couple of hours with an expert would be amazing.

Here's the relevant text from Jacob Perkin's blog (http://thedatachef.blogspot.com/):  

Monday, August 12, 2013

Using Hadoop to Explore Chaos

Hadoop. Hadoop has managed to insinuate itself into practically every company with an engineering team and some data. If your company isn't using it, you know a company that is. Hell, it's why you're reading this to begin with. That being said, what you're probably doing with Hadoop is boring and uninspired. It's not your fault of course. Pretty much every example out there pigeonholes Hadoop into default business use cases like etl and data cleaning, basic statistics, machine learning, and GIS.

You know what though? Sometimes it's good to explore things that don't have an obvious business use case. Things that are weird. Things that are pretty. Things that are ridiculous. Things like dynamical systems and chaos. And, if you happen to find there are applicable tidbits along the way (*hint, skip to the problem outline section*), great, otherwise just enjoy the diversion. 

motivation


So what is a dynamical system? Dryly, a dynamical system is a fixed rule to describe how a point moves through geometric space over time. Pretty much everything that is interesting can be modeled as a dynamical system. Population, traffic flows, fireflies, and neurons can all be describe this way.

In most cases, you'll have a system of ordinary differential equations like this:

x1˙⋮xn˙==f1(x1,…,xn)fn(x1,…,xn)x1˙=f1(x1,…,xn)⋮xn˙=fn(x1,…,xn)


For example, the Fitzhugh-Nagumo model (which models a biological neuron being zapped by an external current):

v˙w˙==v−v33−w+Iext0.08(v+0.7−0.8w)v˙=v−v33−w+Iextw˙=0.08(v+0.7−0.8w)


In this case vv represents the potential difference between the inside of the neuron and the outside (membrane potential), and ww corresponds to how the neuron recovers after it fires. There's also an external current IextIext which can model other neurons zapping the one we're looking at but could just as easily be any other source of current like a car battery. The numerical constants in the system are experimentally derived from looking at how giant squid axons behave. Basically, these guys in the 60's were zapping giant squid brains for science. Understand a bit more why I think your business use case is boring?

One of the simple ways you can study a dynamical system is to see how it behaves for a wide variety of parameter values. In the Fitzhugh-Nagumo case the only real parameter is the external current IextIext. For example, for what values of IextIext does the system behave normally? For what values does it fire like crazy? Can I zap it so much that it stops firing altogether?

In order to do that you'd just decide on some reasonable range of currents, say (0,1)(0,1), break that range into some number of points, and simulate the system while changing the value of IextIext each time.

    chaos


    There's a a lot of great ways to summarize the behavior of a dynamical system if you can simulate its trajectories. Simulated trajectories are, after all, just data sets. The way I'm going to focus on is calculation of the largest lyapunov exponent. Basically, all the lyapunov exponent says is, if I take two identical systems and start them going at slightly different places, how similarly do they behave? 

    For example, If I hook a car battery to two identical squid neurons at the same time, but one has a little bit of extra charge on it, does their firing stay in sync forever or do they start to diverge in time? The lyapunov exponent would measure the rate at which they diverge. If the two neurons fire close in time but don't totally sync up then the lyapunov exponent would be zero. If they eventually start firing at the same time then the lyapunov exponent is negative (they're not diverging, they're coming together). Finally, if they continually diverge from one another then the lyapunov exponent is positive.

    As it turns out, a positive lyapunov exponent usually means the system is chaotic. No matter how close two points start out, they will diverge exponentially. What this means in practice is that, while I might have a predictive model (as a dynamical system) of something really cool like a hurricane, I simply can't measure it precisely enough to make a good prediction of where it's going to go. A really small measurement error, between where the hurricane actually is and where I measure it to be, will diverge exponentially. So my model will predict the hurricane heading into Texas when it actually heads into Louisanna. Yep. Chaos indeed.

    problem outline


    So I'm going to compute the lyapunov exponent of a dynamical system for some range of parameter values. The system I'm going to use is the Henon Map:
    xn+1yn+1==yn+1−ax2nbxnxn+1=yn+1−axn2yn+1=bxn
    I choose the Henon map for a few reasons despite the fact that it isn't modeling a physical system. One, it's super simple and doesn't involve time at all. Two, it's two dimensional so it's easy to plot it and take a look at it. Finally, it's only got two parameters meaning the range of parameter values will make up a plane (and not some n-dimensional hyperspace) so I can make a pretty picture.

    What does Hadoop have to do with all this anyway? Well, I've got to break the parameter plane (ab-plane) into a set of coordinates and run one simulation per coordinate. Say I let a=[amin,amax]a=[amin,amax] and b=[bmin,bmax]b=[bmin,bmax] and I want to look NN unique aa values and MM unique bb values. That means I have to run N×MN×M individual simulations!

    Clearly, the situation gets even worse if I have more parameters (a.k.a a realistic system). However, since each simulation is independent of all the other simulations, I can benefit dramatically from simple parallelization. And that, my friends, is what Hadoop does best. It makes parallelization trivially simple. It handles all those nasty details (which distract from the actual problem at hand) like what machine gets what tasks, what to do about failed tasks, reporting, logging, and the whole bit.

    So here's the rough idea:


    1. Use Hadoop to split the n-dimensional (2D for this trivial example) space into several tiles that will be processed in parallel
    2. Each split of the space is just a set of parameter values. Use these parameter values to run a simulation.
    3. Calculate the lyapunov exponent resulting from each.
    4. Slice the results, visualize, and analyze further (perhaps at higher resolution on a smaller region of parameter space), to understand under what conditions the system is chaotic. In the simple Henon map case I'll make a 2D image to look at.
    The important silly detail is this. The input data here is minuscule in comparison to most data sets handled with Hadoop. This is NOT big data. Instead, the input data is a small file with n lines and can be thought of as a "spatial specification". It is the input format that explodes the spatial specification into the many individual tiles needed. In other words, Hadoop is not just for big data, it can be used for massively parallel scientific computing.
      

    implementation


    Hadoop has been around for a while now. So when I implement something with Hadoop you can be sure I'm not going to sit down and write a java map-reduce program. Instead, I'll use Pig and custom functions for pig to hijack the Hadoop input format functionality. Expanding the rough idea in the outline above:

    1. Pig will load a spatial specification file that defines the extent of the space to explore and with what granularity to explore it.
    2. A custom Pig LoadFunc will use the specification to create individual input splits for each tile of the space to explore. For less parallelism than one input split per tile it's possible to specify the number of total splits. In this case the tiles will be split mostly evenly among the input splits.
    3. The LoadFunc overrides Hadoop classes. Specifically: InputFormat (which does the work of expanding the space), InputSplit (which represents the set of one or more spatial tiles), and RecordReader (for deserializing the splits into useful tiles).
    4. A custom EvalFunc will take the tuple representing a tile from the LoadFunc and use its values as parameters in simulating the system and computing the lyapunov exponent. The lyapunov exponent is the result.
    And here is the pig script:

    1
    2
    3
    4
    5
    6
    7
    define LyapunovForHenon sounder.pig.chaos.LyapunovForHenon();
     
    points = load 'data/space_spec' using sounder.pig.points.RectangularSpaceLoader();
     
    exponents = foreach points generate $0 as a, $1 as b, LyapunovForHenon($0, $1);        
      
    store exponents into 'data/result';

    You can take a look at the detailed implementations of each component on github. See:LyapunovForHenonRectangularSpaceLoader

    running


    I want to explore the Henon map over a range where it's likely to be bounded (unbounded solutions aren't that interesting) and chaotic. Here's my input file: 

    1
    2
    3
    $: cat data/space_spec
    0.6,1.6,800
    -1.0,1.0,800

    Remember the system?
    xn+1yn+1==yn+1−ax2nbxnxn+1=yn+1−axn2yn+1=bxn
    Well, the spatial specification says (if I let the first line represent aa and the second be bb) that I'm looking at an 800×800800×800 (or 640000 independent simulations) grid in the ab-plane where a=[0.6,1.6]a=[0.6,1.6] and b=[−1.0,1.0]b=[−1.0,1.0]

    Now, these bounds aren't arbitrary. The Henon attractor that most are familiar with (if you're familiar with chaos and strange attractors in the least bit) occurs when a=1.4a=1.4 and b=0.3b=0.3. I want to ensure I'm at least going over that case.

    result


    With that, I just need to run it and visualize the results.


    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    $: cat data/result/part-m* | head
    0.6 -1.0    9.132244649409043E-5
    0.6 -0.9974968710888611 -0.0012539625419929572
    0.6 -0.9949937421777222 -0.0025074937591903013
    0.6 -0.9924906132665833 -0.0037665150764570965
    0.6 -0.9899874843554444 -0.005032402237514987
    0.6 -0.9874843554443055 -0.006299127065420516
    0.6 -0.9849812265331666 -0.007566751054452304
    0.6 -0.9824780976220276 -0.008838119048229768
    0.6 -0.9799749687108887 -0.010113503950504331
    0.6 -0.9774718397997498 -0.011392710785045064
    $: cat data/result/part-m* > data/henon-lyapunov-ab-plane.tsv

    To visualize I used this simple python script to get:


    The big swaths of flat white are regions where the system becomes unbounded. It's interesting that the bottom right portion has some structure to it that's possibly fractal. The top right portion, between b=0.0b=0.0and b=0.5b=0.5 and a=1.0a=1.0 to a=1.6a=1.6 is really the only region on this image that's chaotic (where the exponent is non-negative and greater than zero). There's a lot more structure here to look at but I'll leave that to you. As a followup it'd be cool to zoom in on the bottom right corner and run this again.

    conclusion


    So yes, it's possible to use Hadoop to do massively parallel scientific computing and avoid the question of big data entirely. Best of all it's easy. 

    The notion of exploding a space and doing something with each tile in parallel is actually pretty general and, as I've shown, super easy to do with Hadoop. I'll leave it to you to come up with your own way of applying it.