Structure des projets
Présentation des principes d’architecture permettant de produire des projets modulaires et maintenables, et d’outils pour faciliter leur adoption.
Dérouler les slides ci-dessous ou cliquer ici pour afficher les slides en plein écran.
Introduction
Structuring a project immediately helps identify both the code elements and ancillary parts, such as dependencies to manage, documentation, etc. High-quality scripts alone are not enough to ensure the overall quality of a data science project.
A clear and methodical organization of code and data makes the project easier to understand and facilitates updating and evolving the code. It’s much easier to improve a production pipeline when its components are clearly and distinctly separated rather than all mixed together.
As in the previous chapter, the goal of a good project structure is to enhance both the readability and maintainability of the code. Also like before, the organization of a project combines formal rules driven by the language (e.g., how the language handles interdependence between scripts) with arbitrary conventions that may evolve over time.
The goal is to remain pragmatic in how we structure the project, adapting it to its intended purpose. Depending on whether the final deliverable is an API or a web app, the technical solution we implement will require different script structures. However, some universal principles can still apply, and we should allow ourselves flexibility if the project’s outputs change over time.
The first and non-negotiable step is to use Git
(see dedicated chapter). With Git
, bad practices that could hinder the project’s future development become very obvious and can be fixed early on.
The key principles are as follows:
- Favor scripts over notebooks — a specific concern for data science projects
- Organize your project modularly
- Adopt community standards for project structure
- (Self)-document your project
These principles are a direct continuation of those covered in the previous chapter.
Demonstration by Example
Here’s an example of a project structure that might bring back memories:
├── report.qmd
├── correlation.png
├── data.csv
├── data2.csv
├── fig1.png
├── figure 2 (copy).png
├── report.pdf
├── partial data.csv
├── script.R
└── script_final.py
Source : eliocamp.github.io
The following project structure makes it difficult to understand the project. Some key questions arise:
- What are the input data to the pipeline?
- In what order are the intermediate data generated?
- What is the purpose of the graphical outputs?
- Are all the scripts actually used in this project?
By structuring the folder using simple rules — for example, organizing it into inputs and outputs folders — we can significantly improve the project’s readability.
├── README.md
├── .gitignore
├── data
│ ├── raw
│ │ ├── data.csv
│ │ └── data2.csv
│ └── derived
│ └── partial data.csv
├── src
│ ├── script.py
│ ├── script_final.py
│ └── report.qmd
└── output
├── fig1.png
├── figure 2 (copy).png
├── figure10.png
├── correlation.png
└── report.pdf
Since Git
is a prerequisite, every project includes a .gitignore
file (this is especially important when working with data that must not end up on Github
or Gitlab
).
A project also includes a README.md
file at the root — we will come back to this later.
A project using continuous integration will also include specific files:
- if you’re using
Gitlab
, the instructions are stored in thegitlab-ci.yml
file; - if you’re using
Github
, this happens in the.github/workflows
directory.
By simply changing the file names, the project structure becomes much more readable:
├── README.md
├── .gitignore
├── data
│ ├── raw
│ │ ├── dpe_logement_202103.csv
│ │ └── dpe_logement_202003.csv
│ └── derived
│ └── dpe_logement_merged_preprocessed.csv
├── src
│ ├── preprocessing.py
│ ├── generate_plots.py
│ └── report.qmd
└── output
├── histogram_energy_diagnostic.png
├── barplot_consumption_pcs.png
├── correlation_matrix.png
└── report.pdf
Now, the type of input data to the pipeline is clear, and the relationship between scripts, intermediate data, and outputs is transparent.
1️⃣ Notebooks Show Their Limits for Production
Jupyter notebooks are very useful for tinkering, experimenting, and communicating. They are thus a good entry point at the beginning of a project (for experimentation) and at the end (for communication).
However, they come with a number of long-term drawbacks that can make the code written in a notebook difficult—or even impossible—to maintain. Here are a few examples:
- All objects (functions, classes, and data) are defined and available in the same file. Any change to a function requires finding its location in the code, editing, and rerunning one or more cells.
- When experimenting, we write code in cells. In a notebook, there is no “margin” to jot down code like in a physical notebook. So we create new cells, not necessarily in order. When rerunning the notebook, this can cause hard-to-debug errors (since the logical execution order is not obvious).
- Notebooks encourage copy-pasting of cells and tweaking code rather than defining reusable functions.
- It is nearly impossible to version control notebooks effectively with Git. Since notebooks are large JSON files behind the scenes, they look more like data than source code. Git cannot easily identify which code blocks have changed.
- Moving notebooks into production is cumbersome, whereas well-written scripts are much easier to productionize (see next parts of the course).
- Jupyter lacks extensions that enforce good practices (linters, etc.), whereas VSCode is perfectly suited.
- Risk of leaking sensitive data, since notebook outputs (e.g.,
head
commands) are written to disk by default.
In summary, their main drawbacks are:
- Limited reproducibility
- Not suited for automation
- Poor version control
These issues are particularly tied to the challenges of data science:
- The early stages of a data science project are exploratory, and notebooks provide a great interface for this. However, stability becomes more important in later phases.
- Data processing code is often developed non-linearly: you load data, transform it, produce outputs (e.g., summary tables), then go back to modify sources or join with other datasets. Although this exploratory phase is nonlinear, making the pipeline linear and reproducible later requires significant discipline.
The recommendations in this course aim to make long-term maintenance of data science projects as lightweight as possible by promoting code that can be reused by others (or yourself in the future). The best practice is to use self-contained Python
scripts (with respect to dependencies) encapsulated within a more or less formal processing pipeline. Depending on the project and infrastructure, this might be a single Python
script or a formal pipeline. The level of formalism should be adjusted depending on available time.
2️⃣ Fostering a Modular Structure
In the previous chapter, we recommended using functions. Grouping several functions into a file is called a module.
Modularity is a fundamental programming principle that involves dividing a program into several independent modules or scripts, each with a specific purpose. As previously mentioned, structuring a project into modules makes the code more readable, maintainable, and reusable. Python
provides a flexible and powerful import system, which allows control over variable scope, name conflicts, and dependencies between modules1.
Separating Code, Data, and Execution Environment Storage
Separating the storage of code, data, and the execution environment is important for several reasons:
- Data Security
By separating data from code, it’s harder to accidentally access sensitive information. - Consistency and Portability
An isolated environment ensures that the code runs reproducibly, regardless of the host machine. - Modularity and Flexibility
You can adapt or update components (code, data, environment) independently.
The next chapter will focus on dependency management. It will show how to link the environment and code to improve project portability.
Sensitive Configurations: Secrets and Tokens
Running code may depend on personal parameters (authentication tokens, passwords…). They should never appear in shared source code.
✅ Best practice: store these configurations in a separate file, not versioned (.gitignore
), in YAML
format — more readable than JSON
.
Example secrets.yaml
token:
api_insee: "toto"
api_github: "tokengh"
pwd:
base_pg: "monmotdepasse"
Reading in Python
import yaml
with open('secrets.yaml') as f:
= yaml.safe_load(f)
secrets
# using the secret
= secrets['token']['api_insee'] jeton_insee
This mechanism turns the file into a Python dictionary that is easy to navigate.
Les tests unitaires sont des tests automatisés qui vérifient le bon fonctionnement d’une unité de code, comme une fonction ou une méthode. L’objectif est de s’assurer que chaque unité de code fonctionne correctement avant d’être intégrée dans le reste du programme.
Les tests unitaires sont utiles lorsqu’on travaille sur un code de taille conséquente ou lorsqu’on partage son code à d’autres personnes, car ils permettent de s’assurer que les modifications apportées ne créent pas de nouvelles erreurs.
En Python
, on peut utiliser le package unittest
pour écrire des tests unitaires. Voici un exemple tiré de ce site :
# fichier test_str.py
import unittest
class ChaineDeCaractereTest(unittest.TestCase):
def test_reversed(self):
= reversed("abcd")
resultat self.assertEqual("dcba", "".join(resultat))
def test_sorted(self):
= sorted("dbca")
resultat self.assertEqual(['a', 'b', 'c', 'd'], resultat)
def test_upper(self):
= "hello".upper()
resultat self.assertEqual("HELLO", resultat)
def test_erreur
if __name__ == '__main__':
unittest.main()
Pour vérifier que les tests fonctionnent, on exécute ce script depuis la ligne de commande :
python3 test_str.py
.----------------------------------------------------------------------
1 test in 0.000s
Ran
OK
Si on écrit des tests unitaires, il est important de les maintenir ! Prendre du temps pour écrire des tests unitaires qui ne sont pas maintenus et donc ne renvoient plus de diagnostics pertinents est du temps perdu.
Transformer son projet en package Python
Le package est la structure aboutie d’un projet Python
autosuffisant. Il s’agit d’une manière formelle de contrôler la reproductibilité d’un projet car :
- le package assure une gestion cohérente des dépendances
- le package offre une certaine structure pour la documentation
- le package facilite la réutilisation du code
- le package permet des économies d’échelle, car on peut réutiliser l’un des packages pour un autre projet
- le package facilite le debuggage car il est plus facile d’identifier une erreur quand elle est dans un package
- …
En Python
, le package est une structure peu contraignante si on a adopté les bonnes pratiques de structuration de projet. À partir de la structure modulaire précédemment évoquée, il n’y a qu’un pas vers le package : l’ajout d’un fichier pyproject.toml
qui contrôle la construction du package (voir ici).
Il existe plusieurs outils pour installer un package dans le système à partir d’une structure de fichiers locale. Les deux principaux sont :
Le package fait la transition entre un code modulaire et un code portable, concept sur lequel nous reviendrons dans le prochain chapitre.
:::
3️⃣ Adopt Community Standards
Turning Your Project Into a Python
Package
A package is the finalized structure of a self-contained Python
project. It provides a formal way to ensure the reproducibility of a project because:
- the package handles dependencies consistently
- the package offers built-in documentation structure
- the package facilitates code reuse
- the package enables scalability—you can reuse a package across projects
- the package simplifies debugging since it’s easier to pinpoint errors in a package
- …
In Python
, packages are relatively easy to set up if you follow good project structuring practices. From the previously discussed modular structure, it’s a short step to a package: simply add a pyproject.toml
file to control how the package is built (see here).
There are several tools for installing a package locally from a file structure. The two most common are:
The package bridges the gap between modular and portable code, a topic we’ll revisit in the next chapter.
4️⃣ Documenting Your Project
The first principle, illustrated in the example, is to favor self-documentation through meaningful names for folders and files.
The README.md
file, located at the root of the project, serves as both the identity card and showcase of the project. On platforms like GitHub
and GitLab
, this file is shown by default on the homepage, making it the first impression—a very brief moment that can be crucial for how the project’s value is perceived.
Ideally, the README.md
includes:
- A description of the context and objectives of the project
- An explanation of how it works
- A contribution guide if the project welcomes input as part of an open-source initiative
A few examples of complete README.md
files in R
:
Footnotes
In this regard,
Python
is much more reliable thanR
. InR
, if two scripts use functions with the same name but from different packages, there will be a conflict. InPython
, each module is imported as its own package.↩︎