MIT AI for Code and Science Workshop - IAP 2023

AI for code and science presents a very active area of current research that bridges multiple areas, including machine learning, programming languages, and software engineering. A substantial portion of recent work in this domain has found success by blending symbolic and neural techniques.

In this 2-day tutorial session, students will get practical experience in artificial intelligence tools to develop code targeted for science applications. Students will learn to combine neural network methods and program synthesis (symbolic) and how to apply these techniques in science. Students wishing to take part should have some programming experience.

As part of the session, we will provide a practical overview of systems/tools to get started in interesting AI for code and science applications. Students will also hear talks from researchers in the space. Hands-on activities may include using popular transformer-based models such as CodeBERT, CodeT5, and Codex. We will touch on recent ideas that can be applied to improve/adapt each of these models to computer science problems such as program repair. This tutorial will be hands-on, with time for participants to play and experiment with working code, try to solve real benchmark cases and get feedback on ideas they may want to pursue in the future.

Workshops

Background and Introductory Remarks

Video link: https://youtu.be/QsYmu5eYXpE

Speaker: Omar Costilla-Reyes, MIT CSAIL

Personal Webpage

Neurosymbolic Programming for Science

Neurosymbolic Programming (NP) techniques have the potential to accelerate scientific discovery. These models combine neural and symbolic components to learn complex patterns and representations from data, using high-level concepts or known constraints. NP techniques can interface with symbolic domain knowledge from scientists, such as prior knowledge and experimental context, to produce interpretable outputs. We identify opportunities and challenges between current NP models and scientific workflows, with real-world examples from across the natural and social sciences.

Workshop 1

Video link: https://youtu.be/SGLRsnv9-E0

Speaker: Minghao Guo, MIT CSAIL

Personal Webpage

Data-Efficient Graph Grammar Learning for Molecular Generation

The problem of molecular generation has received significant attention recently. Existing methods are typically based on deep neural networks and require training on large datasets with tens of thousands of samples. In practice, however, the size of class-specific chemical datasets is usually limited (e.g., dozens of samples) due to labor-intensive experimentation and data collection. Another major challenge is to generate only physically synthesizable molecules. This is a non-trivial task for neural network-based generative models since the relevant chemical knowledge can only be extracted and generalized from the limited training data. In this tutorial, we explore a data-efficient neurosymbolic generative model that can be learned from datasets with orders of magnitude smaller sizes than common benchmarks. At the heart of this method is a learnable graph grammar that generates molecules from a sequence of production rules. Additional chemical knowledge can be incorporated in the model by further grammar optimization.

Workshop 2

Video link: https://youtu.be/q6tjKXmhiMs

Speaker: Miles Cranmer, Princeton

Personal Webpage

An Introduction to Symbolic Regression with PySR and SymbolicRegression.jl

PySR is an open-source library for practical symbolic regression, a type of machine learning that discovers human-interpretable symbolic models in the form of simple mathematical expressions. PySR is built on a high-performance distributed backend, SymbolicRegression.jl, which offers a flexible search algorithm, and interfaces with several deep learning packages. In this tutorial I will describe the nuts and bolts of the search algorithm and how PySR may be used in machine learning and scientific workflows. I will review existing applications of the software to science (https://astroautomata.com/PySR/papers/), and then present an interactive coding tutorial where we will go through several example symbolic regression problems with different levels of customization. Following this, we will look at using PySR as a distillation tool for translating deep neural networks into an interpretable scientific language, and go through additional examples.

Workshop 3

Video link: https://youtu.be/odyylffr290

Speaker: Jose Cambronero, Microsoft

Personal Website

Learning to Automatically Fix Compiler Errors in C

In this tutorial, we will introduce participants to the automated repair of compiler errors. We will focus our efforts on a collection of C programs written by students in an introductory programming class. We will explore different neural approaches to fixing such compiler errors, including large pre-trained language models and smaller fine-tuned models. By the end of this tutorial, participants will have practical experience with multiple repair approaches, pointers towards extensions/improvements of the approaches surveyed, and a foundation to explore automatically repairing such errors in their own research.

Workshop 4

Video link: https://youtu.be/dxZr-I2rCgQ

Speaker: Shashank Srikant, MIT CSAIL

Personal Webpage

Generating Code that Activates Our Brains

In this tutorial, we will introduce how our brains respond to code comprehension. Further, we will explore how a program can be automatically modified, such that the modified program predicts high responses in specific regions of our brains. The system we will build to achieve this will introduce and utilize backpropagation and the Gumbel softmax trick. We will hack through some popular code models available on Huggingface to build this system.