Tuesday, August 4, 2015

Online Course Review: Coursera and Stanford University's Mining Massive Datasets

It's been quite a while since I last posted—and eight months since I finished the class I'm reviewing. As it happens, I finally got a job as a data scientist in late January, and work has kept me busy. That job will be the subject of my next post, but right now, we're talking about Mining Massive Datasets, offered by Coursera and three professors from Stanford University, Jure Leskovec, Anand Rajaraman, and Jeff Ullman.

The seven-week course covers the same ground as the trio's book Mining of Massive Datasets, though in much less detail. (That link, incidentally, is to the e-book; if you really want the hardcover, you're welcome to follow this sponsored link to Amazon.) It's now been offered twice on Coursera, with a third iteration set to start on September 15th. Oddly, I never intended to take this class: a friend of mine was interested in it, and I signed up so that we could take it together, but initially I had passed on it. I had taken a look at the schedule, seen a few topics that I had covered before in other courses, and decided that I wouldn't get a lot out of it.

What I missed, in my hasty scan of the course description, was that Mining Massive Datasets is not the typical data science course that shows students how to put useful algorithms into practice through code. Instead, this course is about the algorithms themselves: how they work, why they work at scale, and how they've been modified to improve performance or cover different situations. There is a lot of math: this is not a course for someone without a solid background in calculus and linear algebra (it's not like you need to remember how to integrate esoteric functions—but you do need to understand the basics). There are not, on the other hand, any programs to write: many of the exercises absolutely can't be done without writing short scripts or using a statistical language from the command line, but the professors don't require any specific language, and the code isn't graded, only the answers. The point is not to create functioning implementations of the algorithms in question, but rather to understand the nuts and bolts of how they work. The exercises are challenging, and sometimes require consulting the e-book, especially when the complexity of a topic makes it hard to grasp in a short lecture.

Although the bulk of the course is devoted to algorithms, Week 1 provides an excellent description of how HDFS and MapReduce work—without ever giving details on Hadoop or the various languages used to write mappers and reducers. In fact, I came away from Mining Massive Datasets with a far better conceptual grasp of distributed file systems than I got from the Udacity course devoted entirely to the subject. (See my review of that course here.)

It should be said that Jeff Ullman is an excruciatingly monotonic lecturer, who sounds like he's reading everything directly from notes—and you likely wouldn't be at all surprised if he suddenly called out, "Bueller...Bueller...." In addition, his explanations are not as clear as those of Leskovec and Rajaraman. In fairness, though, Ullman tends to cover the most complex topics in the course, and I was always able to figure things out by consulting the book (which covers the material in greater depth, anyway).

In summary, I strongly recommend Mining Massive Datasets for anyone who wants to understand the nitty-gritty of algorithm design for big data. The course is not, however, for the faint of heart. You could make a very successful career in data science without ever taking or it taking anything like it—but taking it will certainly make you better at the profession.

18 comments:

  1. Good blog writing. There is a lot of difference in learning online and offline. In offline you will get a good guidance which you will not find in online.
    so Innomatics is the best Data Science Course training institute in Hyderabad. Enroll now and get a free demo.
    Data Science Course in Hyderabad

    ReplyDelete
  2. I am really happy with your blog because your article is very unique and powerful for new. Online Data Science Training in Pune, Mumbai, Delhi NCR

    ReplyDelete
  3. I’ve learn some just right stuff here. Definitely value bookmarking for
    revisiting. I wonder how so much attempt you put to create one of
    these magnificent informative site

    Free data visualization software Trial
    data visualization software
    advantages of data visualization

    ReplyDelete
  4. Data Science
    Fantastic blog! Thanks for sharing a very interesting post
    Selenium

    ETL Testing

    AWS

    ReplyDelete
  5. This comment has been removed by the author.

    ReplyDelete
  6. Thank you for sharing the article. The data that you provided in the blog is informative and effective.

    Servicenow Training in Hyderabad

    ReplyDelete
  7. Thanks for a very interesting blog. What else may I get that kind of info written in such a perfect approach? I’ve a undertaking that I am simply now operating on, and I have been at the look out for such info. customer data platform,

    ReplyDelete
  8. This comment has been removed by the author.

    ReplyDelete
  9. Thanks for sharing valuable information for every learner. I am also sharing one more data science training institute in Gurgaon.

    ReplyDelete
  10. Amazing blog.Thanks for sharing such excellent information with us. keep sharing...
    data analytics courses delhi

    ReplyDelete
  11. PrimeVideo offers a variety of services, and this includes MyTV. There are many ways in which you can save money by watching TV on your computer instead of paying full price for cable. PrimeTV can help you save hundreds of dollars per year, and they offer many plans so you can find one that fits your lifestyle Visit www .primevideo.com mytv

    ReplyDelete
  12. Amazing knowledge and I like to share this kind of information with my friends and hope they like it they why I do
    full stack developer course

    ReplyDelete
  13. Whether you're just starting out in the field of Data Science or you're looking to enhance your existing skills, APTRON is the best choice for your Data Science Course in Gurgaon
    Contact us today to learn more about our courses and start your journey towards a successful career in Data Science.

    ReplyDelete
  14. Hi.

    Thank you for sharing this information, this Article was excellent and this information was so useful to us small businesses has facing lot challenges, we learn some good information from this Article thank for sharing this Information.

    Here is sharing some Oracle BI Publisher information may be its helpful to you.



    Oracle BI Publisher Training

    ReplyDelete