Friday, February 15, 2013

Stanford's Introduction to Databases vs. Big Data University's SQL Fundamentals I: A Comparison and Review of Online Courses

It's been too long since I last posted here. To put things simply, I found that I was spending more time maintaining the blog as a resource for others to learn data science than I was spending actually learning data science myself. Now that I've got more experience with online courses under my belt, I'd like to share my insights.

When I first became interested in data science, I began to take Big Data University's SQL Fundamentals I course. This course uses IBM's free DB2 Express-C platform, and content and links on the DB2 webpages, as well as the branding in some of the older course material, indicate a connection of some sort between Big Data University and IBM itself, though I couldn't find any statement of the nature of this connection. Posts in the Big Data forums mention that SQL Fundamentals I was originally a true course, with a schedule and interaction between teachers and students; it's now a self-paced course that makes use of video lectures by a variety of instructors, exercises (downloaded in PDF form), required reading (from free e-books produced by the DB2 community), and a final exam. It covers not only the basics of SQL (including queries and database modification), but theoretical modules as well, specifically, relational algebra and relational design theory.

Stanford's Stanford University's Class2Go offers only a small number of courses (three, at present), but these include a 10-week Introduction to Databases, taught by Professor Jennifer Widom, which was originally offered in Fall Quarter of 2011. It's being offered again now, in Winter Quarter of 2013. Rather than partnering with a third-party provider of massive open online courses (MOOC's), such as Coursera or edX, Stanford has opted to go it alone, hosting the course on its recently established Class2Go (interestingly, Stanford professor Andrew Ng's popular Machine Learning course is still on Coursera, which Ng co-founded). Introduction to Databases uses video lectures, interactive online quizzes and exercises, and exams; supplemental readings are suggsted but not required. This review will address the parts of the course that cover the same subjects as the Big Data University course: parts of the introduction, relationsal algebra (most of week 2), SQL (week 3), and relational design theory (week 4); the 10-week course also covers XML, UML, OLAP, NoSQL, and some advanced SQL topics, such as triggers, views, and authorizations.

First of all, full disclosure: I didn't get past "Getting Started" and "Lesson 1" in the Big Data University course before I started taking the Stanford one. That may not sound like a lot, but it includes most of the reading (seven of the eight chapters) in the course—and the most difficult reading—and I got bogged down with that, though I did finish all of it. That means, ironically enough, that I never completeed the actual SQL portions of the Big Data course, though I've since examined some of the lectures and exercises for those portions. I've been through only the first four weeks of the Stanford course, but I wanted to publish this review in time for readers to join the course late and still be able to get something out of it.

On the whole, I think the Stanford course is the better one, mostly because the lectures contain more material, and the exercises are more demanding. Specifically, the Stanford course features lectures that go into greather depth, and exercises in relational algebra and writing SQL queries that require a lot more thought than the Big Data exercises on the same subjects; there's more emphasis on the logic being applied, rather than merely learning rules and syntax, but, as is often the case, struggling with difficult problems helps to solidify memory of rules and syntax. It also helps that the Stanford course is more interactive: I haven't made any use of the virtual office hours provided by the course's teaching assistant, and both courses have forums, but the Stanford course has a few nice extra features, such as short quizzes during lectures, and automated online exercsies and quizzes that allow you to check to see if your answers are right or wrong—often many times over—without revealing the correct answers and thereby preventing you from working them out for yourself. The Big Data University course does cover some syntactical nuances of SQL that the Stanford course misses. Moreoever, as someone with more training in statistics than in computers, I already have a decent grasp on the logic used in relational algebra and SQL queries, because it's quite similar to what a statistician uses in recoding variables and filering cases. Nonetheless, I think that the Stanford course, because its exercises ask more of a student, does a better job of teaching rules and syntax.

The Stanford course also handles theoretical subjects better. In the Big Data University course, "Lesson 1" includes relationsal algebra and relational design theory, but the lectures cover only the basics of relational design theory, and skip relational algebra altogether. The rest of the material is relegated to the readings, and while I would normally prefer this, since my reading speed is much faster than the speed of a recorded lecture, these readings, while clearly the result of loving hard work, are not well written. The primary problem seems to be that most if not all of the authors are non-native speakers (one thing you learn grading papers is that native speakers and non-native speakers tend to make entirely different errors), and it's extemely difficult to write coherent text in a language other than your native tongue. Indeed, the reading on relational calculus (which, in fairness, is a subject not covered by the Stanford course) was so difficult to follow that I never did glean even the most basic principles from it—and I'm a person who got 800's on the math and logic sections of the GRE. My criticism, by the way, doesn't apply to the recorded lectures, which are quite understandable even though some of the lecturers are also authors of the written materials, and all of the lecturers appear to be non-native speakers. The Stanford course is not without its own weaknesses: judging by both my own experience and posts in the course's forums, the lectures on relational design theory, especially the sections on decomposition and normalization, simply don't go into enough detail to allow students to grasp the subjects in question and complete the exercises—the ideas are all there, but not always spelled out. Nonetheless, the Stanford course covers even this material better than does the Big Data course.

The main drawback of the Stanford course, obviously, is that it's only offered at specific times. As of this writing, the course is presently in week 5, and it looks like you can still register (the registration page is still up, but I didn't create a fake account just to make sure that it works) and try to catch up; given the amount of work each work (several hours a week, with weeks 2-4 being especially tough, this could be difficult for those lacking time or dedication, but Professor Widom does stress that the course is suitable for "a la carte" learning, picking and choosing the topics of interest. The course materials were available online after the close of the last instance (Look, I used database jargon!) of the class, in Fall 2011, but they were taken down and re-used for this one; hopefully, the materials will again be offered online for those who need to learn about databases before the next round of the course is offered.

Finally, one strong point of the Big Data University course that bears mentioning is that the lectures cover downloading and installing a specific SQL platform, DB2 Express-C. By contrast, the Stanford course relies on a web front-end superimposed on SQLite; that means that students don't have to worry about installing any software (they're invited to install software and download the exercise databases if they like, but there's no requirement to do so, and I've been quite successful in doing everything online), but it is nice that I learned the basics of DB2 in the Big Data Course. The flip side of that, of course, is that DB2 is only one of many SQL platforms, though it's admittedly one of the most popular.

UPDATE: Widom has announced that after the current iteration of Introduction to Databases concludes, all of the class materials (including interactive exercises) will remain online.

No comments:

Post a Comment