2.2.1. Operations on file search and mapping

Most of the analysis occurs on data stored in the file system. The ability to accurately manipulate (primarily loading) these files is an essential skill. Accurately loading and manipulating files improves the overall efficiency of the process, as it ensures that the correct data is being used for analysis, reducing the risk of errors and inconsistencies.

Engineering implementation of file operations requires good understanding of the file structure and format, as well as size and complexity of the files being analyzed. Large or complex files can pose challenges in terms of processing power, memory usage, and time required for analysis.

2.2.1.1. Applying generator function

For persistent inflow data, the cost of memory consuming should be under consideration. Code 2.12 demonstrates the logic to return a iterable container via two different implemented methods. Compare to the case of iter2, iter1 will increasingly occupy the cache until release after return, which readily results in memory leak if the scale of cases is large enough.

Code 2.12 generator function
def iter1():
    res = []
    for case in cases:
        res.append(case)
    return res

def iter2():
    for case in cases:
        yield case

def iter3():
    return (_ for _ in cases)

print(all([id(v1) == id(v2) for v1, v2 in zip(iter1(), iter2())]))  # True
print(all([id(v1) == id(v2) for v1, v2 in zip(iter1(), iter3())]))  # True

Notably, the tuple comprehension in Python will trigger lazy evaluation. In that case a generator object will be return. From example above, iter3 can be seem as an another equivalent implementation for iter2.

2.2.1.2. File searcher and folder relocation

For universality, interfaces of search or filtering are designed in form of high-order function. Example of Code 2.13 shows two different pipelines for searching image files with jpg or jpeg suffix, and for finding empty files.

Code 2.13 file searcher
from info.me import io, Unit
import os
p = Unit(mappings=[io.search_from_root])
p1 = p.shadow(search_condition=lambda x: x[-3:] == 'jpg' or x[-4:] == 'jpeg')
p2 = p.shadow(search_condition=lambda x: os.stat(x).st_size == 0)

Occasionally files in identical folder needs some kind of uniformed treatment (like using labels as folder names for data with different attributes). Under this circumstance, folder relocation can escape redundant operation on files. Following the p1 in Code 2.13, the example implementation in Code 2.14 shows an operation of reading images located in different file folders named using file format suffix (as shown in example right side).

Those two searchers can both be run, in desktop or hadoop distributed file systems. Refer their documentations for details.

Code 2.14 folder relocation
p3 = p.shadow(search_condition=lambda x: x[-3:] == 'bmp')
for f in io.leaf_folders(data='root_path'):
    imgs = p1(data=f) if f[-3:] == 'JPG' else p3(data=f) if f[-3:] == 'BMP' else []

2.2.1.3. Universal data loader

The charm of functional programming is using function to define behaviors. It is of greater scalability to deal with different dataset. Simple combination and definition can make pipeline be capable for loading any type of data:

Code 2.15 meta data loader
from info.me import io, Unit
from PIL import Image
import nibabel as nib
import numpy as np
p = Unit(mappings=[io.search_from_root, io.generic_filter])
bmp_loader = p.shadow(search_condition=lambda x: x[-3:] == 'bmp', apply_map=lambda x: np.array(Image.open(x)))
jpg_loader = p.shadow(search_condition=lambda x: x[-3:] == 'jpg', apply_map=lambda x: np.array(Image.open(x)))
nii_array = p.shadow(search_condition=lambda x: x[-3:] == 'nii' or x[-6:] == 'nii.gz',
                     apply_map=lambda x: nib.load(x).get_fdata().transpose((2, 0, 1)))

Authors:

Chen Zhang

Version:

0.0.5

Created on:

Feb 18, 2024