Data Day Texas 2015 Recap

Saturday was Data Day Texas (twitter), a single day conference covering a variety of big data topics up at the University of Texas’s conference center.  I went in my HP Helion big data guy role, and my wife Irma went as a python developer and PyLadies ATX organizer.  I’ve written up some notes on the conference for those interested and unable to attend.  As far as I know, there weren’t any recordings made, so this may be more useful than some other more archived conferences.

The conference was held at the University of Texas’s Conference Center.  It’s a nice facility, and probably appropriate for the number of people, but I think the place they hold Lone Star Ruby’s a little more friendly.  Conference organizers estimated the turnout at about 600 folks.  From what I saw, when presenters asked questions like ‘how many of you are x’, the audience breakdown was something like:

  • 70% app developers (not clear # of big data app vendors vs devs wanting to use big data)
  • 10% data scientists
  • 10% business types
  • 10% ops people

Big takeaways were that landscape immaturity is a big deal, and that’s forcing people to weigh trade-offs between the approaches they think are right, and the ones with the most traction (specific example was samza vs spark streaming at Scaling Data), because nobody wants to commit to building out all the features themselves, or getting stuck with the also-ran.  This is a problem for serious developers who want to architect or build systems with multi-year lifespans.  Kafka got mentioned a lot as a glue piece between parts of data pipelines, both at the front and at the back.  Everybody was talking about Avro and Parquet as best practice formats, and lots of calls not to just throw CSVs into HDFS.  There was a Python Data Science talk that ended on a somewhat gloomy note (the chance to build a core Python big data tool may have passed, and a lot of work will need to be done to stay competitive, slides at http://www.slideshare.net/wesm/pydata-the-next-generation).

The specific sessions I went to:

A New Year in Data Science: ML UnpausedPaco Nathan from Databricks

A talk that wandered through the ecosystem.  Paco’s big into containers right now.  Things he specifically called out as good:

The Thorn in the side of big Data: Too Few Artists by Christopher Re

A Few Useful Things to Know about Machine Learning by Pedro Domingos

He emphasized focusing on features, not algorithms as you develop your big data solutions.  Don’t get tied to a model, as our practices are all around proving or disproving models.  Build something that helps you build models.

Machine Learning: A Historical and Methodical Analysis (Historic, AI Magazine 1983)

He recommended the Partially Derivative Podcast, too.

Application Architectures with HadoopMark Grover

Related to the O’Reilly book: http://shop.oreilly.com/product/0636920033196.do

Mark talked about likely tradeoffs weighed in building a Google Analytics style clickstream processing pipeline.  Talked about Avro and Parquet, optimizing partition size (>1 gig data per day = daily partitions, <1 gig = monthly/weekly), Flume vs Kafka and Flume + Kafka, Kafka Channel as a buffer to ensure non-duplication, Spark Streaming as a micro-batch framework, and the tradeoffs of resiliency vs latency.  I think the clickstream analytics example is one of the ones in the book, so if this is interesting and you want more details, just buy an early access copy.

Beyond the Tweeting ToasterP Taylor Goetz

A general talk about sensors, Arduino, and Hadoop.  The demo was a tweeting IoT device, and Irma won it in the giveaway!

Real Time Data Processing Using Spark StreamingHari Shreedharan

Hari talked about Spark Streaming’s general use cases.  Likely flow was:

Ingest (Kafka/Flume) -> Processing (Spark Streaming) -> R/T Serving (Hbase/Impala)

He talked about how Spark follows the DAG to re-create results as its fault-tolerance model.  This was pretty cool, and an interesting way of thinking about the system.  Because you know all the steps taken to create the data, you can re-generate it at any time if you lose part of it by tracing it back and running those steps on that data subset again.  Spark uses Resilient Distributed Datasets to do this, and Spark Streaming essentially creates timestamped RDDs based on your batch interval (Default 2 seconds).

There’s good code reuse between spark streaming and regular spark, since you’re running on RDDs in the same code execution environment.  No need to throw your code away and start over if you want to do batch vs micro-batch.

Containers, Microservices and Machine LearningPaco Nathan

On the container and microservices front, Paco recommended watching Adrian Cockroft’s DockerCon EU keynote, State of the Art In Microservices.  He then walked through an example using textrank and pagerank as a way to create keyword phrases out of a connected text corpus (specifically apache mailing lists).

He mentioned databricks spark training resources, which look extensive: http://databricks.com/spark-training-resources

Building Data Pipelines with the Kite SDKJoey Echevarria

http://www.kitesdk.org/

Kite is an abstraction layer between the engine and your data that enforces best practices (always use compression, for instance).  It uses a db->table->row model that it calls namespace->dataset->entity.  He mentioned that they’d seen little performance difference between using raw HDFS vs Hive for ETL tasks, all things considered.  Use Avro for row based data (when you need context) and Parquet for column oriented data (when you need to sum/scan or only deal with a few columns).

Building a System for Event -Oriented Data by Eric SammerCTO of Scaling Data

A great talk on practical problems building large scale systems.  Scaling Data has built a product that essentially creates a kafka firehose for the enterprise datacenter, re-creating a lot of tooling I’ve seen at Facebook and other places, and making a straightforward-to-install enterprise product out of it.  They pipe stuff into solr for full text search (ala splunk), feed dashboards for alerts, archive everything for later forensics, etc.

He recommended this blog post by Jay Kreps at Linkedin on real-time data delivery mechanics:

http://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying

Said their biggest nut to crack was the 2 phase delivery problem, guaranteeing that events would only land once.  They write to a tmp file in HDFS, close the hdfs file handle and ensure sync, then mark as read in kafka, then go process the tmp file.

Talked a lot about summingbird.  Said it was probably the right way to add stuff up, but that it was too big and gangly, so they’d written something themselves.  He recommended this paper by Oscar Boykin on Summingbird that covers a lot of the problems building this kind of system.

Also talked about Samza (best approach for the transform part of the pipeline, in their opinion, but low level and lacking community support), Storm (rigid, slow in their experience), and Spark (they hate it, but the community likes it, so they use it).

Wrapup

It was a harried (no lunch break, no afternoon break, if you were feeling burned out, you had to skip a session) conference, but that might be the nature of a one day brain-binge.  The organizers were happy to reserve a table for PyLadies in the Data Lounge, and they had a mini-meetup and got a little outreach done.

Dwarf Fortress, Facebook, Big Data and the Search for Story

Last night after driving home from the Austin PyLadies meetup, my wife sat in our driveway for 20 minutes listening to the end of an episode of WNYC’s Radiolab.  Later, after we’d headed to bed, she spent another 20 minutes retelling the story to me, minus Radiolab’s flourish and production.  The story was still interesting second hand, and comes down to this (I’ll wait if you’d like to go listen to the episode of Radiolab, I’m sure it’s excellent):

Two people discover hundreds of letters from WWII on the side of Route 101.  They’re from soldiers replying to a woman on the homefront.  The soldiers call her mom, but she isn’t their mother.  The two ask around, no one knows anything about them.  One of them, a creative writing professor, ends up using the letters as projects for his students.  He gives them a letter, and their task is to create a story around it.  A soldier, a woman stateside, an unlikely connection.  The other discoverer wants to track down relatives, she wants to uncover the truth.  She ends up discovering it, but he’d rather not know.  He wants the possibilities.

Even told second hand, the story stuck with me on a meta-level.  There aren’t a lot of things that would make my wife sit in the car in the driveway for 20 minutes listening to the radio, but a good story is one.  We love stories, we love it when they’re well crafted and well told.  But we also love the possibilities of them.  Sometimes we don’t want the truth, we want magic, we want to dream the dream of what could be.  Sometimes the truth can’t exist, and the closest we can get is a dim outline of it.  Sometimes the dream is better.

The Promise: Stories that Tell Themselves

A few days ago I ran across a blog post by Tynan Sylvester, a designer on the game Bioshock Infinite.  It’s all about the dream of simulations for game designers, how we think that by creating more and more complex systems, we might eventually build a system that is complex enough to manifest stories.  Austin Grossman’s latest novel, YOU, is about that, in a way.  The protagonist is a game designer and the antagonist is just a manifestation of some long-running game rules.  As game designers, we want to design games that surprise us.  That’s the ultimate payoff, to build a game that entertains you, and not just a twitch game that is enjoyable for its mechanics, but a game with stories compelling enough to sit in the car in the driveway for 20 minutes at 9 o’clock at night.

Lots of game designers have tried to do this. Tynan talks specifically about systems in early versions of Bioshock where the player would have to play autonomous bots (splicers, gatherers and protectors) off each other to progress.  They hoped that amazing, emergent gameplay would be the result.  In the end it didn’t work, and the game moments that they’d hoped would happen spontaneously ended up being heavily scripted.  Players crave story, but that story can’t be left up to their persistence and chance, especially when creating a commercial title.  In that environment, a great story has to be guaranteed.

Dwarf Fortress: Madness in Text Mode

There are a few notable exceptions to this principle, and they’re mainly smaller games driven by singular minded creators.  The best example of this is Dwarf Fortress, a massive and inscrutable simulation game where the the player takes on the role of an overseer, and the titular dwarves are simulated autonomous entities inhabiting the world.  Dwarves have names and hair colors, what Tynan calls Hair Complexity, things that add perceived simulation depth without effecting anything else.  (When was the last time you played an RPG where a plot point hinged on your hair style?)  They also have more integrated systems like hunger and social needs.  They have personalities, they get sad, and sometimes they go crazy.  The dwarves live in a randomly generated world, so your game isn’t like my game, and even my second game won’t be like my first.

Dwarf FortressDwarf Fortress has a very dedicated core following, and one of the reasons is that it really lives at the edge of apophneia, the experience of seeing meaningful patterns emerge from random data.  At the core of Dwarf Fortress is a collection of rules governing behavior.  A dwarf without food will eventually starve.  A dwarf without personal interaction may eventually go crazy.  Dwarves are scared of wolves.  Dwarves exist in a world generated fractally, a world that feels real because it mirrors patterns in nature.  Therefor, as more and more rules get layered on, and more and more people play more and more games and get better and better at creating experimental mazes for these digital rats to play in, stories begin to appear, or so we perceive.

Two of the most famous stories to come out of Dwarf Fortress games are Boatmurdered, the tale of an epic game played out by members of the Something Awful forums in 2007, and Bronzemurder, a beautiful infographic-style tale of a dwarf fortress and a terrible monster.  Go read it, it’s great.

Dwarf Fortress didn’t generate these stories, though.  People played the game, sometimes hundreds or thousands of times, and while gazing into the mandala of the game, they nudged and pulled the threads of the world and created stories based on the events that occurred there.  Dwarf Fortress isn’t a windup toy, it’s a god-game, and the players impact on the game world is more than negligible.  The stories generated there are as much created by the players as by the game.

I Fight For the Users

While my wife was out at PyLadies last night, I coincidentally watched TRON: Legacy.  It occurred to me as I was thinking about writing this post, that it’s a movie about this possibility: The dream of a world inside a computer, a world created by a brilliant programmer, a world that once set in motion can create stories, unexpected events and enthralling narrative.  The creator steps aside, and no longer controls the game from the top-down.  The creator becomes a god among men, watching things unfold from their level.

Tron: Legacy - Quorra

In TRON: Legacy, the magic of digital life comes in the form of Quorra, the last of the ISOs, Isometric entities that appear spontaneously from the wasteland of the computer.  Digital DNA, digital life.  Enough rules, enough circuitry, enough care and magic happens.  That premise is exciting, and to programmers it’s intoxicating.  For those of us in the digital generation, that’s the dream we live with.  That’s what we keep trying to make happen wherever we go and whatever project we work on, be it big data or software bots.

But the lone programmer, no matter how brilliant, and working for no matter how long, can only produce so much code.  Stories from one person only grow so far, only change so much, and rarely surprise and enthrall.  Dwarf Fortress as a dwarf isn’t a game most people would play.  It’s hard to see the overall story, and the game isn’t good at presenting it.  But if there were more players…

EVE Online: More Interesting to Read About Than to Play

If it’s possible (albeit insanely difficult) to have stories appear in a single player game, it must be easier for stories to manifest in a multi-player game, right?  Games like World of Warcraft have largely fixed, planned out stories.  It comes back to the challenge that Bioshock had, complex systems are exciting to designers, but players want immediate story gratification.  Complex systems take dedication to understand, dedication most players don’t have.  When new multiplayer games are announced they sometimes hint at players making a real impact on the world, but those systems usually fail to live up to the hype.  The latest game to promise this is The Elder Scrolls Online.  We’ll see if they can do it.

One game that does this and thrives is EVE Online.  EVE is a massively multiplayer online space combat simulation, one that spans an entire universe.  It’s possible to play EVE as a loner, but it’s also possible to align yourself with a faction, and have your small efforts merge with hundreds or even thousands of others to build armadas and giant dreadnaught ships, to control entire solar systems and even galaxies.  The designers and administrators of EVE take a largely hands-off approach.  They don’t want to kill the golden goose, so they design the game for balanced conflict, and let the players sort it out.

EVE-Online-Battle-of-Asakai-3Every once in a while something epic happens in EVE, either a massive fraud, an invasion a faction planned for months, or a random accident that led to a game-rebalancing war.  There are battlefield reports, and once the space dust settles, people start to put together a history, and an accessible storyline appears.  Here are a few great EVE stories.  More people probably enjoy the reports of epic battles in EVE through these stories than actually play the game.  To quote a MetaFilter comment thread: “This game sounds stressing as hell if you really play it and not just dither around. Fascinating to read about, however, almost like news from a parallel universe.

You could say that EVE is a computer program for generating stories, and in fact the’ve even made a deal to do a TV show based on player stories from the EVE universe.  Except again we find that that EVE isn’t the thing generating the stories, EVE is just a place where the stories happen.  To a player only experiencing the events inside the game it may seem mysterious and amazing, and it certainly is to those of us who read about the events afterwards, but it’s really just a sandbox.  People play pretend with enforceable rules, but you can’t separate a story that happens inside of EVE with the real life stories that happen outside of it: The scheming that happens on IRC or in forums, the personal vendettas, the flexible allegiances  and the real-world money that flows through the system.  There’s no way to watch something occur inside of EVE, and even if you had perfect clarity on everything that happened inside, have any way of knowing for sure what really caused it.  If you take away the players, the legions of dedicated fans scheming and plotting, you just have an empty universe.

Facebook and the Timeline of Truth

I think a lot of web developers secretly wanted to be game designers.  Becoming a game designer is difficult, there aren’t as many jobs and the hours are terrible.  Instead we build web sites, but we’re building systems too, and we want to tell stories.

I joined Facebook back in April of 2006.  I had a @swt.edu address from Southwest Texas State (now Texas State University) from an extremely brief stint (sub 1 day) as an IT staff-member, so I got in a few months before they opened it for everyone.  Getting into a new, exclusive social network is a bit like finding a new simulation.  We hope the software can tell us new stories, that it can make some sense of the data it has.  With Facebook the promise was that if it collected enough information about us, it could tell us that magical story.  That’s what Timeline was supposed to do.  Give Facebook enough photos, enough checkins, enough friend connections, enough tagged posts and it would be able to tell the story of our lives.

Facebook Timeline

In the end, though, Timeline doesn’t tell you a real story.  It reminds you of stories you’ve heard and experienced, but Facebook is only a dumb algorithm working with imperfect data.  It’s smart enough to target ads, but it can’t understand the meaning, and it can’t remix the data in really compelling ways.  It can’t be Radiolab.  Most of the time the prioritization it comes up with I just want to turn off.  Its attempts at story are so bad I’d rather use my own organic cognitive story filters.

With every new Facebook feature announcement, with Google+ or the next thing that processes all your activity, the promise is that the system can get better at telling those stories.  We want to believe it will happen.  We want to believe that a couple thousand web developers and a couple billion dollars could create a story machine, but I’m not sure it can.  I was reading an article about HP’s R&D budget the other day that said Facebook invests 27.5% of revenue in R&D, a larger percentage than any other company they tracked.  You can bet a good chunk of that is going towards the search for story, in some form or another.

Weaving a Web

I’d be remiss if I didn’t mention Weavrs at this point, since they are essentially digital actors that derive stories from the mess of social media.  Weavrs are designed specifically for apophneia, they produce content one step up from random, and rely on our desire for patterns to throw away the things that don’t fit.  We project stories on to them, and for a project with the limited resources that it had, it’s exceedingly good at it.

My weavr twin is posting about HP Moonshot servers.  That’s almost eerie, but it’s also posting about hockey tickets.  The story makes sense if I’m picky about the things I include, but it isn’t an internally consistent narrative.  The narrative is impressed on it by the people who see it, like reading digital tea leaves.  Your story of my weavr is different than mine.

With enough resources and time, weavrs might become a real story machine.  That’s a moonshot program, though, and I don’t know who’s going to step forward and make that happen.  Investment follows money, and right now the money is racing towards big data.

Autonomy: Billions and Billions

The lure of story, the promise of meaning from the chaos of data isn’t limited to games or the social web.  It’s the romantic beating heart of big data.  It’s the stories about Target knowing you’re pregnant before you do.  It’s what lured HP to spend $8.8 billion dollars more than it was worth to acquire Autonomy.

Autonomy’s main product is called the Intelligent Data Operating Layer, or IDOL (symbology, ahoy!).  They call the processing of information with it Meaning-Based Computing.  From what I’ve heard it’s certainly good at what it does, but while it promises Meaning from Data, and that promise separated HP from 9 Instagrams or 2,500 Flickrs, there has to be some apophenia at work here.  Just like watching solar system battles inside of EVE gives you a piece of the story and playing hundreds of games of Dwarf Fortress will result in games worth telling stories about, the system data is never the entire picture.

Screenshot_6_13_13_11_52_PMI really like Stephen Wolfram.  Stephen believes in the fundamental computability of everything.  While I love reading his blog posts, and I am interested in and admire his idea, I have to wonder how far the hyperbole is from actual execution.  Given enough computable facts and enough understanding about the structure of narrative, a perfect Wolfram|Alpha should be able to tell me stories about the real world.  But it can’t.  They aren’t even trying to approach that.  Wolfram|Alpha isn’t creating Radiolab.  They want answers, not stories.  You know what tells stories? Dirty, messy, all-too-human Wikipedia.

A Different Kind of Magic

My friend Matt Sanders works for a bay area company called Librato.  Librato is a big data startup, having pivoted from some other work to running a service that collects vast amounts of metrics and provides dashboards on top of it.  With Librato Metrics you can feed data points, set alert triggers, create graphs, and watch activity.  It’s big data without the prediction.  It promises no magic, but relies on our own.  It optimizes data for processing by human eyeballs.

The 3 pounds of grey matter between your ears is still the best computer we have, running the best software for deriving stories and making sense of data.  Librato works because it doesn’t try to be what it can’t.  Google Analytics tries to offer Intelligence Events, but more often than not, it can’t offer anything more helpful than that visits are up from Germany 34%.  You would think that by combining traffic source analysis with content changes and deep data understanding Google would be able to tell you why visits are up from Germany, but most of the time that basic percentage is the best it can offer.  It still takes that 3 pounds of meat to pull together the data and interpret it into a story.  While computers may be generating articles on company reports or sports games, they’re not creating Radiolab.

Wrapping Up

I think there’s still a lot of room for innovation here.  The Archive Project I dreamed of long ago is essentially a system for telling stories and discovering meta-stories.  Maybe someone will finally build it.  Maybe the next Dwarf Fortress will be a world that runs persistently in the cloud, a world where our games interact with other people’s games, where crowdsourced Hair Complexity snowballs until you can get lost in the story if you want to. A game where if you want to turn off a random path and follow it down to the river you’ll find a fisherman who will tell you a tale interesting enough to make you sit in your car for 20 minutes, enthralled by a narrative.

Maybe the framing of a story is what big data needs to become personally relevant.  Maybe that’s its magic trick.  Maybe narrative is the next great big data frontier.

Future Past

iPad

I sometimes wonder about the generation of kids growing up today, in this big data, analytic-driven, always-on world.  I wonder how they will embrace it, like we embraced computers and connectivity.  I wonder if they’ll have the ability to hear the prognostications of the computer, to listen to the story from the machine, and consider it a kind of truth.  To internalize it, but also keep it separate.  To know the machine knows a truth, but not necessarily the absolute truth.  Maybe that will be their power, the thing they can do that those of us from the generation before can’t. Maybe that is where the dream finally comes true.