2.2.1. Operations on file search and mapping¶
Most of the analysis occurs on data stored in the file system. The ability to accurately manipulate (primarily loading) these files is an essential skill. Accurately loading and manipulating files improves the overall efficiency of the process, as it ensures that the correct data is being used for analysis, reducing the risk of errors and inconsistencies.
Engineering implementation of file operations requires good understanding of the file structure and format, as well as size and complexity of the files being analyzed. Large or complex files can pose challenges in terms of processing power, memory usage, and time required for analysis.
2.2.1.1. Applying generator function¶
For persistent inflow data, the cost of memory consuming should be under consideration. Code 2.12
demonstrates the logic to return a iterable container via two different implemented methods. Compare to the case of
iter2, iter1 will increasingly occupy the cache until release after return, which readily results in
memory leak if the scale of cases is large enough.
def iter1():
res = []
for case in cases:
res.append(case)
return res
def iter2():
for case in cases:
yield case
def iter3():
return (_ for _ in cases)
print(all([id(v1) == id(v2) for v1, v2 in zip(iter1(), iter2())])) # True
print(all([id(v1) == id(v2) for v1, v2 in zip(iter1(), iter3())])) # True
Notably, the tuple comprehension in Python will trigger lazy evaluation. In that case a generator object will be
return. From example above, iter3 can be seem as an another equivalent implementation for iter2.
2.2.1.2. File searcher and folder relocation¶
For universality, interfaces of search or filtering are designed in form of high-order function. Example of Code 2.13 shows two different pipelines for searching image files with jpg or jpeg suffix, and for finding empty files.
from info.me import io, Unit
import os
p = Unit(mappings=[io.search_from_root])
p1 = p.shadow(search_condition=lambda x: x[-3:] == 'jpg' or x[-4:] == 'jpeg')
p2 = p.shadow(search_condition=lambda x: os.stat(x).st_size == 0)
Occasionally files in identical folder needs some kind of uniformed treatment (like using labels as folder names
for data with different attributes). Under this circumstance, folder relocation can escape redundant operation on
files. Following the p1 in Code 2.13, the example implementation in Code 2.14
shows an operation of reading images located in different file folders named using file format suffix (as shown in
example right side).
Those two searchers can both be run, in desktop or hadoop distributed file systems. Refer their documentations for details.
p3 = p.shadow(search_condition=lambda x: x[-3:] == 'bmp')
for f in io.leaf_folders(data='root_path'):
imgs = p1(data=f) if f[-3:] == 'JPG' else p3(data=f) if f[-3:] == 'BMP' else []
2.2.1.3. Universal data loader¶
The charm of functional programming is using function to define behaviors. It is of greater scalability to deal with different dataset. Simple combination and definition can make pipeline be capable for loading any type of data:
from info.me import io, Unit
from PIL import Image
import nibabel as nib
import numpy as np
p = Unit(mappings=[io.search_from_root, io.generic_filter])
bmp_loader = p.shadow(search_condition=lambda x: x[-3:] == 'bmp', apply_map=lambda x: np.array(Image.open(x)))
jpg_loader = p.shadow(search_condition=lambda x: x[-3:] == 'jpg', apply_map=lambda x: np.array(Image.open(x)))
nii_array = p.shadow(search_condition=lambda x: x[-3:] == 'nii' or x[-6:] == 'nii.gz',
apply_map=lambda x: nib.load(x).get_fdata().transpose((2, 0, 1)))
- Authors:
Chen Zhang
- Version:
0.0.5
- Created on:
Feb 18, 2024