[ICSE14] Patch Verification via Multiversion Control Flow Graphs: One Year Retrospective

It is ICSE time again. Have fun in Italy, everyone!  Here, I’d like to share some of my thoughts on our ICSE14 paper as well as the valuable discussions I have had with my colleagues Guowei Yang, Cristian Cadar,  Eric Bodden, Christian Kastner, Hridesh Rajan and Yang Liu over the year.

In this paper, we developed a program representation MVICFG and static analysis techniques to address the questions of how to determine commonalities and variabilities of program properties over similar code and multiple versions, and why it is useful. Here, the “program properties” used are bugs, and “similar code” are software revisions and releases.

Through this research, we found that 1) a patch can contain bugs (also reported in Yin et al.’s 2011 FSE paper), 2) a patch cannot always fix all the affected versions, 3) documentation on which versions are impacted can be incomplete, and MVICFGs can be used to automatically compute such information, and 4) leveraging cached results computed for previous versions, we are able to verify changes in seconds, which is practical for continuous integration in industry.

Guowei and I have discussed the differences between his ICSE 2009 paper, regression model checking, and our paper. The former first computes all the statements that are impacted by a change and then performs model checking on the change impacted statements. The latter is to first identify all the paths affected by the change and then applied demand-driven analysis on the changed paths. Demand-driven analysis computes the dependency information and determines the properties at the same time.  Since demand-driven analysis terminates when the properties are determined, i.e., it does not necessarily cover all the change impacted statements and paths, and the model checking is performed on the entire set of affected statements, I’d say our approach still can be faster for computing many of the program properties.

In terms of future work,  Cristian has mentioned that it will be useful to apply KLEE to such representations to better solve some of the patch testing problems they had. Eric and Christian both mentioned that it may be interesting to use it to explore the challenges caused by the combinations of changes and configurations. Hridesh has discussed with me on whether it is feasible to use MVICFGs to represent program versions in Boa. Last but not least, Liu Yang’s group is interested in using the MVICFG to model malware variants.

To further improve the MVICFG,  we are still thinking about how to better represent the changes introduced in a new version. In this paper, we took an old version and a new version (when all the changes have already been made) and tried to align the statements across versions using function level textual diffs. The problems are twofold. We ignore the process of changes, and thus the changes we represent in an MVICFG  are not always the changes developers intended to make. To solve this problem, we need to integrate the information collected through code change process. There actually exist such code editors to do that documented in Susan Horwitz’s papers.  Another problem we faced is how we can better represent semantic changes in MVICFGs rather than syntactic changes. For instance, if the code performs refactoring, no changes should be shown on the MVICFG.  In addition to thinking about how to represent changes, we are considering how to develop a language independent version of the MVICFG  to make it beneficial to more types of analyses.


[ICSE’13] Segmented Symbolic Analysis: One Year Retrospective

After one year of finishing the ICSE’13 paper Segmented Symbolic Analysis, I’d like to share some of the excitements I have had when I wrote this paper. Essentially, this paper reflects the three visions I have had about program analysis.

1. Program analysis on segments. Traditionally, functions are important units to perform program analysis on. For example, intra-procedural analysis is a type of program analysis that only focuses on the information within a function, while inter-procedural analysis considers the program information in the context of calling relations. An important reason is that the implementation of the analysis is heavily reliant on compilers to get static information such as types or scopes, or to produce executables for dynamic analysis. In this paper, we developed a type of program analysis that breaks function boundaries and flexibly performs on selected code segments based on demand. This year, we did great amount of work on analyzing code repositories. We found that many revisions are not compilable or not very easy to be compiled, which largely restricts the use of the powerful semantic analysis for legacy code. We need to make semantic program analyses to be applicable to any piece of code rather than compilable units. This paper suggests it is potentially feasible.

2. Data analysis with program analysis: In this paper, we use linear regression to infer transfer functions useful for program analysis. Here, the data analysis is not used to infer declarative program properties, such as what are the relations of variables at a program point. Rather, I consider it as an initial step to synthesize instructions, steps of computation from the data.

3. Hybrid static and dynamic techniques: This idea is not new. The interesting direction is to invent various ways to combine static and dynamic techniques for maximum precision and scalability.

Visual Studio: Parsing XML files

Using System.XML package

Two steps:
1) in the .cpp file, add #using <System.XML.dll>, then using namespace System::XML
2) Add path of System.XML.DLL, e.g., ..\..\..\..\Windows\Microsoft.NET\Framework\v2.0.50727, to /Project/Property Pages/C/C++/General/Resolve #using References

Static Code Profiler

A useful tool to profile code.  The tool and tests can be downloaded here

Input:  c/c++/java files
Output: classification of statements based on their operators and data types of variables

Example output:
Statement classification based on computation
17 – the number of arithmetic
13 – the number of conditions (including loop conditions)
9 – the number of loops
10 – the number of goto, break, continue, return
74 – the number of calls

Statement classification based on data types
171 – the number of statements with only scalar (int, long, char, Boolean, short) data type
0 – the number of statements with float point – single precision
94 – the number of statements with double
0 – the number of statements with containers (stack, array, list, queue)
127 – the number of statements with user defined data types

Visual Studio: Debugging .cpps files

Four steps:
1. open Windows cmd
2. set environment by typing vcvars32
3. cl /Zi hello.cpp
4. open visual studio, menu open project, open hello.exe