🪴 jaden lorenc

Search

Search IconIcon to open search

Visualize a Codebase

Last updated Aug 22, 2023 Edit Source

#project #python

# Summary:

It’s hard to understand codebases. I’d like an automatic tool for it, that can show it to me in a visual structure.

# Tools

# possible outline

  1. Get PyCG call graphs for the project.
  2. Create dot files from call graph.
  3. Grab function-code pairs, as well as control flow data, from ast.
  4. Include the relevant code into the functions in the dot files.
  5. Create subgraphs in the dot files, corresponding to control flow logic within the functions.
  6. Remap function calls from between function-function to between subgraph function call-function.
  7. Run functions and parts of functions through OpenAI API to get summaries, and add the summaries to the nodes in the dot file.
  8. Create visualization from the dot file with javascript package.
  9. One of two things:
    • Create tree-like structure to allow for hiding/revealing of nodes, code sections, and summaries.
    • Somehow create on-click handlers for elements on the visualization to allow for more hands-on hiding and revealing of elements?

# Generated project plan:

Your idea sounds fantastic! Let’s break down the steps to achieve this:

  1. Parse the Python Codebase: Use PyCG to parse the python codebase. This will allow you to generate the required static analysis data.

  2. Generate Graph Data: Convert the data you get from PyCG into a format that’s suitable for visualization. You would likely want a node-link structure. Each function, class, or module can be a node, and the links can represent calls, inheritances, etc.

  3. Visualize with Graphviz: Use Graphviz, especially the DOT language, to create an initial visualization of the graph data. This serves as your first prototype.

  4. Interactive Features:

    • Zoom and Pan: Use a tool or framework like D3.js to make the visualization interactive. D3.js allows you to zoom, pan, and manipulate the SVG elements easily.
    • Collapse and Hide Sections: This can be implemented by adding interactivity where certain nodes or groups of nodes can be collapsed or expanded. Again, D3.js is perfect for this.
    • 2D/3D Space: If you want a 3D representation, consider using Three.js.
  5. Code and Summaries:

    • Displaying Code: When a user clicks or hovers over a node, display the associated code snippet in a pop-up or a side panel.
    • AI-Augmented Summaries: Use GPT-4 API to summarize the code sections. When a user interacts with a node, the AI can provide a brief, human-readable summary of what that function/class/module does.
  6. Automatic Collapsing:

    • Use metrics such as function length, complexity, or the number of internal function calls as parameters. Functions or modules with low complexity could be collapsed by default, or you could use the GPT-4 model to predict which sections are less relevant and collapse them.
  7. User Interface: Create an intuitive UI where users can interact with the graph, navigate through it, and get insights about the codebase.

  8. Continuous Updates:

    • File Watcher: Implement a file watcher to detect changes in the codebase. When a file changes, you can re-analyze and update the visualization in real-time.
  9. Additional Features:

    • Search & Highlight: Allow users to search for specific functions, classes, or variables and highlight them in the graph.
    • Dependencies: Show external module dependencies and how they are used in the codebase.
  10. Deployment & Scaling: Consider whether this tool will be web-based, a standalone application, or a plugin for an IDE. Also, consider scalability if you want to analyze large codebases.

Building this would be a significant project, but the outcome could be an incredibly useful tool for developers. The combination of visualizing code relationships and having AI-powered insights could redefine the way developers explore and understand codebases. Best of luck with your project!

# implementation

I asked about how the output of pycg could be used, as well as pulling out code to send to LLMs:

The data snippet you’ve shared is a representation of how functions/modules (the keys) relate to other functions/modules (the values in the list) in the form of a call graph.

To focus on functions defined within a specific directory:

  1. Filter by Prefix: One approach is to use a naming convention that includes the module’s path, then filter keys based on that prefix.

    1
    2
    
    directory_prefix = "entity_extraction_module.relationship_extraction"
    filtered_data = {key: value for key, value in pycg_data.items() if key.startswith(directory_prefix)}
    

    This way, you only retain function calls that are inside the given directory/module. However, note that if external modules/functions are called from within your target directory, they’ll still appear in the lists but won’t have their own keys.

  2. Collect Function Definitions: To retrieve the actual content of the functions, you’d need to use another tool or library that’s able to parse the Python code and extract function definitions. A library like ast (Abstract Syntax Tree) that’s built into Python can be quite useful for this.

    Here’s a simple way to grab the content of functions using the ast module:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    
    import ast
    
    class FunctionVisitor(ast.NodeVisitor):
        def __init__(self):
            self.function_data = {}
    
        def visit_FunctionDef(self, node):
            source_code = ast.get_source_segment(code, node)  # code is the source code of the file as a string
            self.function_data[node.name] = source_code
            self.generic_visit(node)
    
    with open('your_script_path.py', 'r') as file:
        code = file.read()
        tree = ast.parse(code)
        visitor = FunctionVisitor()
        visitor.visit(tree)
        # Now, visitor.function_data contains function names as keys and their source code as values.
    

    For the above to work, you need to have the entire codebase as strings. If you have a directory, loop through each file and process them one by one.

  3. Integration with GPT-4: Once you have extracted the function content, you can pass it to GPT-4 for further processing, summarization, or any other operation you’re interested in.

Always remember to handle any potential exceptions or issues, like file access errors or AST parsing errors. And depending on the size of the codebase, extracting and processing function definitions might take some time, so consider implementing some caching or other optimization techniques as necessary.